searching files very slow

Topics: User Forum
Sep 22, 2014 at 6:26 PM
Hello!

I created a simple program which is supposed to search 'query' in the first column of all .xls or .xlsx files in a given 'directory' and then return each row (6 cells) where 1st cell is matching 'query', grouping them as a table.

This program actually works as it was supposed, but it's VERY slow to be usable. The script is working about 400 seconds and more, also it takes a lot of memory - up to several hundreds of Mb for some large files. In the 'directory' I have 38 .xls and .xlsx files each up to 5 Mb of size (~ up to 120 000 rows).

So my question: is it possible to speed up somehow the program (maybe the algorythm in general or the way I use PHPExcel) to consume some reasonable amount of time (several seconds)? Would be ideal to speed it up to 1 second or less. Thank you in advance!

My code:
<?php
/** Error reporting */
error_reporting(E_ALL);
ini_set('display_errors', TRUE); 
ini_set('display_startup_errors', TRUE); 

define('EOL',(PHP_SAPI == 'cli') ? PHP_EOL : '<br />');

/** Include PHPExcel */
require_once dirname(__FILE__) . '/../../PHPExcel/Classes/PHPExcel.php';

date_default_timezone_set('UTC');

if(isset($_POST['query']) && trim($_POST['query']) != '')
{
    $dir    = dirname(__FILE__);
    $files = scandir($dir);
    $found = False;
    
    print "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"><title>Test search</title>";
    print "<table border=1>\n";
    
    print "<tr>\n";
    print "<td>Col1</td><td>Col2</td><td>Col3</td><td>Col4</td><td>Col5</td><td>Col6</td>";
    print "</tr>\n";
    
    
    for($i=0;$i<count($files);$i++){    
        $file=$files[$i];
        $is_xls = stripos($file, ".xls");
        $is_xlsx = stripos($file, ".xlsx");
        if($is_xls === False && $is_xlsx === False)
            continue;
        $pathToFile =  $dir."/".$file;
        $inputFileType = PHPExcel_IOFactory::identify($pathToFile);
        $objReader = PHPExcel_IOFactory::createReader($inputFileType);
        $objReader->setReadDataOnly(true);
        $objPHPExcel = $objReader->load($pathToFile);
        $objWorksheet = $objPHPExcel->getActiveSheet();
        // Get the highest row and column numbers referenced in the worksheet
        $highestRow = $objWorksheet->getHighestRow(); 
        $highestColumn = $objWorksheet->getHighestColumn(); 
        $highestColumnIndex = PHPExcel_Cell::columnIndexFromString($highestColumn);
        
        $query = str_replace('-', '', preg_replace('/\s+/', '', strtolower(trim($_POST['query']))));
        
        for($row=1;$row<$highestRow;$row++){
        
            $firstColumn = $objWorksheet->getCellByColumnAndRow(0, $row)->getValue();
            $name = str_replace('-', '', preg_replace('/\s+/', '', strtolower(trim($firstColumn))));
            $pos = strpos($name, $query);
            if( $pos == True || $pos === 0){
                $found=True;
                print "<tr>\n";
                print "<td>".$objWorksheet->getCellByColumnAndRow(0, $row)->getValue()."</td>
                <td>".$objWorksheet->getCellByColumnAndRow(1, $row)->getValue()."</td>
                <td>".$objWorksheet->getCellByColumnAndRow(2, $row)->getValue()."</td>
                <td>".$objWorksheet->getCellByColumnAndRow(3, $row)->getValue()."</td>
                <td>".$objWorksheet->getCellByColumnAndRow(4, $row)->getValue()."</td>
                <td>".$objWorksheet->getCellByColumnAndRow(5, $row)->getValue()."</td>\n";
                print "</tr>\n";
            }
        }
        
        $objPHPExcel->disconnectWorksheets();
        unset($objReader);
        unset($objPHPExcel);
        unset($objWorksheet);
    }
    if(!$found){
        print "<tr>\n";
        print "<td>Nothing found.</td>\n";
        print "</tr>\n";
    }
    print "</table>\n";
} else
# If "search" parameter doesn't exist, show an error and exit 
{
    header("HTTP/1.1 500 Internal Server Error");
print "<h1>500 Internal Server Error</h1>".
      "Request is not set";
exit;
}
My software:
PHPExcel 1.8.0
PHP 5.6.0
Apache 2.4.10
Windows 7
Coordinator
Sep 22, 2014 at 10:34 PM
Suggestion would be to use read filters, and do each file in 2 passes: Use a first read filter just to read column A, as that's what you're searching on; do your search and build an array of the rows where you find your matches.... then (only if you actually find matches in that file) use the second pass to read the rows that you want, using a second read filter just to retrieve those rows,

But if you're looking to get this down to 1 second per file, then I'd suggest paying for a license for the commercial libXl and using Ilia's wrapper extension to access that