Identifying CSV files.

Topics: Developer Forum, User Forum
May 3, 2013 at 9:29 PM
Edited May 3, 2013 at 9:39 PM
Mark,

I'm back again testing PHPExcel 1.7.8 for a new project I'm working on and found an odd thing with the IOFactory::identify and IOFactory::createReaderForFile methods. They will not recognize a csv file. They both use the following code:
$pathinfo = pathinfo($pFilename);

$extensionType = NULL;
if (isset($pathinfo['extension'])) {
  switch (strtolower($pathinfo['extension'])) {
    case 'xlsx':      //  Excel (OfficeOpenXML) Spreadsheet
    case 'xlsm':      //  Excel (OfficeOpenXML) Macro Spreadsheet (macros will be discarded)
    case 'xltx':      //  Excel (OfficeOpenXML) Template
    case 'xltm':      //  Excel (OfficeOpenXML) Macro Template (macros will be discarded)
      $extensionType = 'Excel2007';
      break;
    case 'xls':        //  Excel (BIFF) Spreadsheet
    case 'xlt':        //  Excel (BIFF) Template
      $extensionType = 'Excel5';
      break;
    case 'ods':        //  Open/Libre Offic Calc
    case 'ots':        //  Open/Libre Offic Calc Template
      $extensionType = 'OOCalc';
      break;
    case 'slk':
      $extensionType = 'SYLK';
      break;
    case 'xml':        //  Excel 2003 SpreadSheetML
      $extensionType = 'Excel2003XML';
      break;
    case 'gnumeric':
      $extensionType = 'Gnumeric';
      break;
    case 'htm':
    case 'html':
      $extensionType = 'HTML';
      break;
    case 'csv':
      // Do nothing
      // We must not try to use CSV reader since it loads
      // all files including Excel files etc.
      break;
    default:
      break;
}
This makes the identification of files pretty much useless to me.
Shouldn't the createReader('CSV') routine throw an exception if the file isn't a valid CSV file?

Anyway, the interesting thing about my tests is that one file with a csv extension fails with an 'unknown file format' error (which is expected with this code), but another file with a csv extension fails because the file was identified as an HTML file (which it isn't) and threw an exception trying to access the listWorksheetInfo method (which does not exist in the HTML reader). Any idea why my one file might be identified as an HTML file when the extension is 'csv'?
  • Christopher Mullins
Coordinator
May 6, 2013 at 7:15 PM
It's probably the next part of the identification code.... if the file "signature" matches the extension (csv excluded) then that will be the recognised filetype, otherwise the IOFactory loops through each supported type in turn trying to identify a signature that it recognises, which determines the file reader that will be used irrespective of the file extension. The fallback is to treat any file that hasn't been unidentified during this test loop as a CSV: the CSV Reader will try to process an file as a CSV because there currently isn't any checking in the canRead method for a CSV file.

As for a csv being recognised as an HTML file, that's probably because the HTML signature check is too simplistic. I probably need to tighten it up and use dom as part of that canRead test; though it still needs to allow for badly formed or partial html being recognised correctly
May 7, 2013 at 2:09 AM
Edited May 8, 2013 at 8:09 PM
Mark,

I've made some customization to my code that allows it to identify both HTML files and CSV files.

CSV identification uses two underlying rules:
  • No blank rows. If blank rows are detected then file is an invalid CSV file.
  • A majority of the rows must have the same number of fields otherwise file is an invalid CSV file.
Would you be interested in adding these modifications into the existing code?

If so, I can paste the the changes here. They are actually relatively minor changes.

Files modified:
  • IOFactory.php - Added ability to pass options to identify and createReaderForFile methods.
  • HTML.php - Added code to canRead method to more accurately test for a valid HTML file in reader.
  • CSV.php - Added code to canRead method to test for a valid CSV file in reader.
The reader modifications add some overhead and time, but only when testing for a valid file so hopefully it is not too onerous.

Let me know if you need me to paste those changes.

-- Addendum: I've just been looking over your changes to the canRead methods in the repository. Looks cleaner and more modular for flexibility. I'll just keep my custom code in place until the next PHPExcel version comes out with these fixes. I'll still have to add my code for the csvReader::_isValidFormat() method in order to trim down on invalid CSV files.

Christopher Mullins