Excel2003XML-Reader failures

Topics: Developer Forum
Jan 10, 2011 at 9:41 AM

I just got a couple of Excel2003 files and tried to read them with PHPExcel (1.7.5), which failed.

The first thing I found was the format detection in canRead(): "my" files had the signature

<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>

which is definitely correct, but does not fit to the existing code. So I changed canRead() to

        public function canRead($pFilename)
        {

//      Office                                  xmlns:o="urn:schemas-microsoft-com:office:office"
//      Excel                                   xmlns:x="urn:schemas-microsoft-com:office:excel"
//      XML Spreadsheet                 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
//      Spreadsheet component   xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet"
//      XML schema                              xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882"
//      XML data type                   xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
//      MS-persist recordset    xmlns:rs="urn:schemas-microsoft-com:rowset"
//      Rowset                                  xmlns:z="#RowsetSchema"
//

                $signature = array(
                                '<?xml version="1.0"',
                                '<?mso-application progid="Excel.Sheet"?>'
                        );

                // Check if file exists
                if (!file_exists($pFilename)) {
                        throw new Exception("Could not open " . $pFilename . " for reading! File does not exist.");
                }

                // Read sample data (first 2 KB will do)
                $fh = fopen($pFilename, 'r');
                $data = fread($fh, 2048);
                fclose($fh);

                $valid = true;
                foreach($signature as $match) {
                        // every part of the signature must be present
                        if (strpos($data, $match) === false) {
                                $valid = false;
                                break;
                        }
                }

                return $valid;
        }

which does the trick for me (though I don't know if there might be other differences, too).

After that I ran into a namespace problem. "My" excel file uses <ss:Worksheet ...></ss:Worksheet>, but in Excel2003XML.php it says

foreach($xml->Worksheet as $worksheet) { ... }

which does only address child elements without a namespace (SimpleXML is kind of a PITA as soon as namespaces are involved). Changing this to

$children = $xml->children($namespaces['ss']);
foreach($children->Worksheet as $worksheet) {
was enough to fix it for me - however, this relies on the nickname "ss" for the spreadsheet-namespace, which should be fine in most cases, but is not guaranteed by any specification (I just wonder why this was missing in the code - I have not managed to get a single Excel2003-file without the ss-namespace for the Worksheet-tag).

Coordinator
Jan 10, 2011 at 10:26 AM
Edited Jan 10, 2011 at 10:50 AM

I'll modify the signature test. My original test workbooks didn't have encoding specified, hence the signature that I was checking for. I'll make this more generic, and put in a charset conversion (with appropriate default if no encoding is specified, to ensure that data is always maintained as UTF-8.

You might want to try using the latest SVN code for the actual load() mathod, because I have made some changes to this reader since the 1.7.5 release, for handling the ss namespace. I'm unsure about whether the ss namespace is always enforced (I suspect not, otherwise the original Reader would never have worked), but it certainly appears in all the XML workbooks I've saved recently. But, I'll see about modifying the code to handle either case.

I just wish the format was consistent

Jan 10, 2011 at 4:11 PM

Ah, yes, the second issue is in fact already solved in the trunk. (I'd guess, the nick "ss" is consistenly used for the namespace, but... well, you can't blame anyone if he does not, that's the problem). AFAIK the correct way should be to call the namespaces by their URI, e.g.

 

$namespaces['ss'] = "urn:schemas-microsoft-com:office:spreadsheet";
$namespaces['c'] = "...";
[...]
$xml_ss = $xml->children($namespaces['ss']);
foreach($xml_ss->Worksheet as $worksheet) { } 


which would make you independent of any nicknames (I haven't tried this, though). It's basically a similar precaution than adding the charset conversion, because even though sometimes the encoding is explicitely specified by Excel I have never seen anything else than UTF-8.

Thanks for taking care about this anyway!