PHPExcel 1.7.8 can ALMOST parse a broken HTML .xls

Topics: Developer Forum
Mar 11, 2013 at 6:09 PM
Edited Mar 11, 2013 at 7:28 PM
I'm using PHP 5.3.10 and PHPExcel 1.7.8 along with XDebug to try to figure out what's going on here.

I'm trying to parse XLS files uploaded to my web app "A" that are created/generated by someone else's web app "B". (B offers no API...)

What I'm seeing when I step through with XDebug is that:
  1. .xls file is identified as Excel (BIFF) Spreadsheet and class
  2. The PHPExcel_Reader_Excel5::canRead($pFilename) is called
  3. PHPExcel_Shared_OLERead::read($sFilename) is called
  4. It's not an OLE file, inside trycatch, so PHPExcel_Reader_Excel5 can't read it
  5. IOFactory::createReaderForFile checks every other format, then PHPExcel_Reader_HTML is the only one that can read it
  6. PHPExcel_Reader_HTML fails inside $dom->loadHTMLFile because of a broken <td> tag! The actual error thrown is WARNING: DOMDocument::loadHTMLFile(): Unexpected end tag : td in /path/to/my/xls/file.xls
What I found out is that by changing line 458 of PHPExcel/Reader/HTML.php from...
$loaded = $dom->loadHTMLFile($pFilename);
$loaded = @$dom->loadHTMLFile($pFilename);
...the class is able to continue. I know, typically, error suppression is the worst way to deal with any code problems, but for this case it actually allowed the class to return data. Maybe this will help someone in the future. I should mention this is a warning, not a fatal error, however using the settings with my framework (Kohana 3.x) warnings are show stoppers.

The code maintainers may want to try catch this part: $dom->loadHTMLFile($pFilename), because ultimately it just does fopen & fread later on down anyway.

Also, lines 472-473 of PHPExcel/Reader/HTML.php include debugging output that should be commented out or removed:
echo '<hr />';
and, internally, the private function _processDomElement method of the PHPExcel_Reader_HTML class has tons of echo statements inside of it, rendering it much less useful for parsing this particular type of file as library code within another application. I've added a
public $suppressOutput = true;
to that class, and preceded all echo statements in the class with
if( ! $this->suppressOutput ) echo ...
so that I can get the return array from parsing my HTML Excel file without printing anything to the OB.

I can submit a patch if you like.