listWorksheetInfo() updated for XMLReader

Topics: Developer Forum
Jan 5, 2013 at 10:26 PM
Edited Jan 5, 2013 at 11:02 PM

Like many others, I'm working in a memory-constrained environment, and have a file which I have no control over that reputedly has ~50000 columns and ~10000 rows in one worksheet, however the cells with actual data are under 1000x1000 -- the rest is empty formatting.

The xlsx file size is a little under 10MB, the XML for one of the sheets is ~60MB (in the xlsx archive), and when simplexml tries to load it for the worksheet info, the php process needs an incremental 1GB+ of ram.

Fortunately I don't need this particular worksheet, but it does stop me from using the stock listWorksheetInfo function.  Looking at MarkBaker's 2013 roadmap, he was kind enough to mention that XMLReader handles XML in chunks, and so I hacked up a quick solution for the Excel2007 reader to use it instead of simplexml for the worksheet files (the metadata files are so small I didn't bother for them).

The end result with XMLReader was a 95% reduction in incremental RAM use and a 10x++ increase in speed in my use case.

Unfortunately the code insert function on the site doesn't seem to be working for this block below, but if anyone else wants to implement this stopgap in 1.7.8, replace the code starting at line 311 and ending at line 331 in Classes/PHPExcel/Reader/Excel2007.php with the following (first and last lines should be the same for your reference):

$fileWorksheet = $worksheets[(string) self::array_item($eleSheet->attributes("http://schemas.openxmlformats.org/officeDocument/2006/relationships"), "id")];

$xml = new XMLReader();
$res = $xml->xml($this->_getFromZipArchive($zip, "$dir/$fileWorksheet"));
$xml->setParserProperty(2,true);

$currCells = 0;
while ($xml->read()) {
    switch ($xml->name) {
    case "row":
        $xml->read();
        $tmpInfo['totalRows']++;
        $tmpInfo['totalColumns'] = max($tmpInfo['totalColumns'],$currCells);
        $currCells = 0;
        $xml->read();
        break;
    case "c":
        $xml->read();
        $currCells++;
        $xml->read();
        break;
    }
}
$xml->close();


$tmpInfo['lastColumnIndex'] = $tmpInfo['totalColumns'] - 1;
$tmpInfo['lastColumnLetter'] = PHPExcel_Cell::stringFromColumnIndex($tmpInfo['lastColumnIndex']); // note, this doesn't seem to work for certain (all?) very large column counts

$worksheetInfo[] = $tmpInfo;



Coordinator
Jan 6, 2013 at 11:06 PM

Unfortunately, the code as it stands won't produce the correct results; you'll get 2 c entries for every column, and 2 row entries for every row, because you'll get both the opening and the closing tags; .... the listWorksheetInfo() code that I've been testing to work with xmlReader streaming uses:

while ($xml->read()) {
    if ($xml->name == 'row' && $xml->nodeType == XMLReader::ELEMENT) {
        $tmpInfo['totalRows']++;
        $tmpInfo['totalColumns'] = max($tmpInfo['totalColumns'],$currCells);
        $currCells = 0;
    } elseif ($xml->name == 'c' && $xml->nodeType == XMLReader::ELEMENT) {
        $currCells++;
    }
}

Generally XMLReader is more memory efficient, but slower. In the case of listWorksheetInfo(), it's marginally faster than the current code because it's looping through the data in a single pass rather than the nested loops of the simpleXML method. With a test file containing 2 worksheets of 16370 rows by 9 columns, it's about 7.5-8 seconds rather than 10, and uses 1.25MB (peak at 7.5MB) rather than 2.5MB (peak at 8.75).

XMLReader streaming should povide a bigger peak memory saving, and is a lot faster than $xml->xml() to load the xml data to the reader resource, as shown in the thread http://phpexcel.codeplex.com/discussions/242712... this is still targetted for the 1.7.9 release sometime in the next couple of months.