Proper way to have PHPExcel set only for 'reading'

Topics: Developer Forum
Feb 6, 2011 at 4:22 AM

I have been going through a number of these forum entries / discussions, and I'm still a bit confused.  I ran a few benchmarks myself to get an idea of the software.

What I am trying to do is mass import (100-100,000) rows or XLS (limit of course of 56k) or XLSX. I would need to read the first row (column names) in order to analyse the data columns needed.
Following that, I would take only the columns that I need (this data import might have other columns that are not needed.. this is for the purpose of say purchase order information).

The required columns would then be imported into my database (via iteration of some sort), currently I have tried the following code:

$inputFileName = "/var/www/default/uploads/".$file;

/**  Identify the type of $inputFileName  **/
$inputFileType = PHPExcel_IOFactory::identify($inputFileName);

/**  Create a new Reader of the type defined in $inputFileType  **/ 
$reader = PHPExcel_IOFactory::createReader($inputFileType); 

/**  Define how many rows we want to read for each "chunk"  **/ 
$chunkSize =2048; 
/**  Create a new Instance of our Read Filter  **/ 
$chunkFilter = new chunkReadFilter(); 

/**  Tell the Reader that we want to use the Read Filter  **/ 
$reader->setReadFilter($chunkFilter); 
$reader->setReadDataOnly(True);

echo "<pre>";

/**  Loop to read our worksheet in "chunk size" blocks  **/ 
for ($startRow = 2; $startRow <= 65536; $startRow += $chunkSize) { 
	/**  Tell the Read Filter which rows we want this iteration  **/ 
	$chunkFilter->setRows($startRow,$chunkSize); 
	/**  Load only the rows that match our filter  **/ 
	$objPHPExcel = $reader->load($inputFileName); 

	//    Do some processing here 
	print_r( $objPHPExcel->getActiveSheet()->toArray() );

	if(count( $objPHPExcel->getActiveSheet()->toArray() ) < $chunkSize){
		$startRow = 65536;
	}
	
	//    Free up some of the memory 
	$objPHPExcel->disconnectWorksheets(); 
	unset($objPHPExcel); 				
} 			

echo "</pre>";				

My processing would most likely involve me creating a SQL insert query and inserting every 100 records or so, etc;.

I am finding that no matter what I have done chunk size, large files usually time out (300+ sec) or run out of memory even tho I unset the $objPHPExcel object.
As you proably can figure out by my copy & paste code, I am very new to PHPExcel. 

Coordinator
Feb 6, 2011 at 10:49 AM
Edited Feb 6, 2011 at 10:55 AM

Assuming you're using the chunkReadFilter() that I posted here a few weeks ago, you're still reading every worksheet, and the chunkReadFilter() will read the same chunk from each worksheet in the workbook. Set your chunk reader to read only from one worksheet at a time, or use $objReader->setLoadSheetsOnly($sheetname) to force the reader only to work with a single, names worksheet at a time;

I've added a listWorksheetNames() method to the Readers for the next release that would allow you to get a list of all the sheet names without reading in the rest of the file. You could call that before your main loop, and then chunk through each of those worksheets in turn.

Be careful using $reader->setReadDataOnly(True) as you may have problems identifying dates in the worksheet when you want to actually process the cell data... you won't be able to differentiate between a date and a number.

 

$objPHPExcel->getActiveSheet()->toArray() is a slow and memory-expensive test to identify the data that has been loaded. Again, with the latest SVN code, I've reduced the memory overhead of toArray(), and made it faster, and I've added an additional rangeToArray() method that allows you to specify a range of the worksheet to return as an array, e.g. 'A'.$startRow.':Z'.($startRow+$chunkSize)

 

Chunking is a lot slower than reading the entire workbook just once, because it has to parse the file every iteration of the loop. It's basically a "trade off" between speed and memory usage. It isn't designed for "online" work, but for background processing where you can increase the timeout value. 300 seconds is likely to be the Web Server timeout.