Having trouble with large Excel file creating (Timeouts and memory limits)

Topics: Developer Forum
Oct 19, 2010 at 10:56 PM

I had some script that was using PEAR's Excel package that was running into some problems with a lot of data in several worksheets.

The temporary solution to this was to only allow the user to export 1 worksheet at a time (Which is working with up around 130000 rows across multiple worksheets- but bombs with another 100000) - and in the meantime I tried changing the code to PHPExcel to see if it could handle the large amount of data better.

I'm using PHPExcel 1.7.4 (PHP 5.2.8)

At the moment I seem to be having more problems with PHP Excel just to get 1 of the worksheets going - let alone multiple ones.

The data is an export from a mysql database - the tale sizes I'm exporting vary from 40000 to 150000 rows.

 

Below is the function of modified from the PEAR version

 

ini_set ( 'memory_limit', '2500M' );
include_once 'ccma_bores/classes/PHPExcel.php';

$excel= new PHPExcel();

$excel->getProperties()->setTitle("CCMA Bore Export ".date("d-m-Y H:i"));

$sheet_number= 1;
........ function build_worksheet($row_array,$sheet_name,$repeat_number=1,$column_names=""){ global $excel; global $sheet_number; //echo "Sheet number: {$sheet_number}<br>"; $excel->createSheet($sheet_number); $excel->setActiveSheetIndex($sheet_number); //echo "Sheet name: {$sheet_name}<br>"; $excel->getActiveSheet()->setTitle(ucwords($sheet_name)); $sheet_number++; set_time_limit(1000); if(count($row_array)==0){ $excel->getActiveSheet()->setCellValue('A1', "There is no data for this worksheet"); return; } $row_number = 1; $x=0; $first_row = $row_array[0]; if(is_array($column_names)){ //Order by prefered order with correct labels foreach ($column_names as $field_name=>$column_name){ $excel->getActiveSheet()->setCellValueByColumnAndRow($x,$row_number,$column_name); $x++; } }else{ foreach ($first_row as $field => $value){ $excel->getActiveSheet()->setCellValueByColumnAndRow($x,$row_number,$field); $x++; } } $row_number = 2; //$test_array = array(array("row1","row2","row3"),array("b1","b2","b3")); //Can get fromArray function to work with small arrays //$excel->getActiveSheet()->fromArray($row_array,"","A2"); //Cannot get fromArray function to work with large arrays foreach ($row_array as $row){ $x=0; $row_values = array(); if(is_array($column_names)){ //Order by prefered order with correct labels foreach ($column_names as $field_name=>$column_name){ $value = $row[$field_name]; $row_values[$column_name] = $value; $excel->getActiveSheet()->setCellValueByColumnAndRow($x,$row_number,$value); $x++; } }else{ foreach ($row as $field=>$value){ $row_values[$field] = $value; $excel->getActiveSheet()->setCellValueByColumnAndRow($x,$row_number,$value); $x++; } } //$excel->getActiveSheet()->fromArray($row_values,"","A{$row_number}");//Cannot get fromArray function to work with large arrays $row_number++; if($row_number>=10000){ //Testing limit before timeouts return; } /*if($row_number==65536){ //Had to disable large worksheet splitting as program was timeing out $row_array = array_slice($row_array,$row_number-1,true); //Start a new worksheet if maximum row value reached build_workskeet($row_array,$sheet_name,$repeat_number+1,$column_names); return ; }*/ } } ......... build_worksheet($bores,'bores',"",$BORE_COLUMN_NAMES); header('Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'); header('Content-Disposition: attachment;filename="bore-export-'.date('d-m-y').'.xls"'); header('Cache-Control: max-age=0'); $objWriter = PHPExcel_IOFactory::createWriter($excel, 'Excel2007'); $objWriter->save('php://output');

 

 


$row_array is basically just an associative array of mysql query rows.

I've had to put some manual limits on the number of rows- I haven't been able to make it work successfully with more than 10000 rows - the setCell function starts taking too long. (It seems to get exponentially slower- if I go to 40000 it bombs without error after a long time)

I've had no luck with using the fromArray() function - it seems to get no results when the array is large

There is a function to split the worksheet when its greater than Excels max rows - but I've had to disable to this as it cant get near it,

What can I do to tidy up memory and time for the setCell function to work with large tables?

Are there any other ways to create large spreadsheets (eg: Compiling 1 workbook at a time and stitching them together in different scripts?)

Coordinator
Oct 22, 2010 at 9:18 AM

You'd certainly get better memory usage by building the worksheet one row at a time as you fetch the rows from your database; rather than loading all the database rows into an array and then building the worksheet from that, whichit means you have both the database array and the worksheet fully memory resident at the same time.

Oct 26, 2010 at 2:51 AM

Unfortunately  thats not the problem - but I tried it for a test anyway and noticed no difference.

 

Row 100 4.99020791053772 secs
Row 100 3.546891927719116 secs
Row 1000 9.283177137374878 secs
Row 5000 35.76950788497925 secs
Row 10000 68.98192405700684 secs
Peak memory usage: 712.5 MB

Beyond that its bonbs without error - maybe its just browser timeouts- i'll try running it from the server

 

Oct 26, 2010 at 3:23 AM
Edited Oct 26, 2010 at 3:24 AM

Hmm - same results from server - the function just bombs without any PHP error whent theres more than about 15000 rows. (The use of the copied array adds very little overhead in the scheme of things)

It seems to use a lot more memory than the PEAR Excel writer - so at the moment I cant even build 1 large worksheet- let  alone the several worksheets I needed 

[26-Oct-2010 13:50:44] CCMA BORES Row 500 5.823396921157837 secs

[26-Oct-2010 13:50:44] CCMA BORES Peak memory usage: 54 MB  Export Saved Successfully

 

 

[26-Oct-2010 13:52:02] CCMA BORES Row 5000 34.06914901733398 secs

[26-Oct-2010 13:52:02] CCMA BORES Peak memory usage: 365.75 MB Export Saved Successfully

 

 

 

[26-Oct-2010 13:54:52] CCMA BORES Row 15000 97.50964403152466 secs

[26-Oct-2010 13:54:52] CCMA BORES Peak memory usage: 1073.25 MB Export Saved Successfully

 


[26-Oct-2010 14:08:36] CCMA BORES Row 30000 195.9688181877136 secs

[26-Oct-2010 14:08:36] CCMA BORES Peak memory usage: 2127.5 MB

 

On those numbers - with 3gb ram in the system I cant build a worksheet with 45000 rows for the worksheet - let alone multiple worksheets.



Is there a more efficient way I can use apart from setCellValueByColumnAndRow?

What is the largest worksheet size possible/been done before?

Is there a way to build the spreadsheet part at a time so I can schedule 5000 rows at a time in different scripts?

 

Example rows:

Bore ID	Parish	Parish No	Longitude (GDA94)	Latitude (GDA94)	Easting (MGA)	Northing (MGA)	Map zone	Elevation (m AHD)	Elevation (digitised)	Constructed date	Constructed depth	Authority	Authority Code	Bore type	Use 1	Use 2	Use 3	Monitoring Status	Monitoring Frequency	Initial SWL (m)	Initial EC (mS/cm)	Initial TDS (mg/L)	Driller	Drill rig	Drilling Method	Bore licence no	Headworks / upstand	Headworks height (m)	Collar	Site description	Land use	Data source	Old Bore ID	Site photo	Photo details	Site access	Bore access	Landholder name	Landholder contact	Other source	Comments	Local bore name	Location Accuracy
100001	WANGOOM	3729	142.48619	-38.318	629921.43	5757856.42	54			1991-08-06 00:00:00	42.67	Not Known		Groundwater	Domestic			N							Down-Hole Hammer Percussion			0				GMS Mar10	15135								parish doesn't match, but located on boundary of correct parish.		
100002	WANGOOM	3729	142.509182	-38.356636	631861.44	5753536.41	54			1991-06-20 00:00:00	48.5	Not Known		Groundwater	Domestic			N							Rotary Air			0				GMS Mar10	15136								Parish matches, Bore Location may be OK.		
100003	WANGOOM	3729	142.539971	-38.356777	634551.45	5753476.41	54			1991-10-07 00:00:00	54.86	Not Known		Groundwater	Domestic and Stock			N							Cable Tool, Percussion			0				GMS Mar10	15137								Parish matches, Bore Location may be OK.		

Bore ID Parish Parish No Longitude (GDA94) Latitude (GDA94) Easting (MGA) Northing (MGA) Map zone Elevation (m AHD) Elevation (digitised) Constructed date Constructed depth Authority Authority Code Bore type Use 1 Use 2 Use 3 Monitoring Status Monitoring Frequency Initial SWL (m) Initial EC (mS/cm) Initial TDS (mg/L) Driller Drill rig Drilling Method Bore licence no Headworks / upstand Headworks height (m) Collar Site description Land use Data source Old Bore ID Site photo Photo details Site access Bore access Landholder name Landholder contact Other source Comments Local bore name Location Accuracy 100001 WANGOOM 3729 142.4862 -38.318 629921.4 5757856 54     1991-08-06 00:00:00 42.67 Not Known Groundwater Domestic     N             Down-Hole Hammer Percussion 0       GMS Mar10 15135               parish doesn't match, but located on boundary of correct parish. 100002 WANGOOM 3729 142.5092 -38.3566 631861.4 5753536 54     1991-06-20 00:00:00 48.5 Not Known Groundwater Domestic     N             Rotary Air   0       GMS Mar10 15136               Parish matches, Bore Location may be OK.

Oct 26, 2010 at 7:57 AM

Hmm, a lot more is possible, with the right hardware :P

Takes abobut 8 hours of querying....and a lot of memory....

20101019 07:45:25 Query complete! (1.18 seconds, 554041 rows)
20101019 07:57:06 Query complete! (27.37 seconds, 6500 rows)
20101019 14:28:39 Report... took: 54279.99 seconds
20101019 14:28:39 Peak memory usage: 26489 MB

 

However, I too would be very happy if we could trim the mem usage down. Atleast for xlsx a partial solution would be to have a streaming export. That would potentially save 50% memory usage (as the results are, I believe, copied to the export object).

Coordinator
Oct 26, 2010 at 8:44 AM
pfeely wrote:

Hmm - same results from server - the function just bombs without any PHP error whent theres more than about 15000 rows. (The use of the copied array adds very little overhead in the scheme of things)

It seems to use a lot more memory than the PEAR Excel writer - so at the moment I cant even build 1 large worksheet- let  alone the several worksheets I needed  

I must say I'm surprised that processing a row at a time as it's read from the database uses as much memory as building an array of data from the database before passing it to PHPExcel to process. That just seems intrinsically wrong... not least because PHP simply needs more memory to hold a large 2D array than it does a single row array.

PHPExcel does use a lot more memory than PEAR, and it isn't as fast... it does a lot more than PEAR (reading and writing, multiple formats, rich text, conditional formatting, etc). We've been striving for over a year now to reduce the memory overhead, and to increase speed, and have made significant inroads in that time... but it is still slower and more memory-hungry than PEAR.

 

pfeely wrote:

Is there a more efficient way I can use apart from setCellValueByColumnAndRow? 

$excel->getActiveSheet()->fromArray($row_array,NULL,'A1'); should work, but you've already tried that... don't know why it didn't work for you, although looking over the method, I can see how to boost its speed slightly.

setCellValue() is fractionally faster than setCellValueByColumnAndRow(), but it's microseconds difference... one call to PHPExcel_Cell::stringFromColumnIndex() and a concatenation.

Your problem seems to be memory rather than simply speed: to conserve memory, look to using cell caching; though there's a cost in speed. My rule of thumb is 1k/cell, doubled when you intend to write the workbook, plus between 10 and 25MB for the code itself. Add 60% to each figure if you're running 64-bit PHP. If you're using cell caching (particularly php://temp or diskISAM), you can reduce that 1k/cell to about 330bytes/cell... you can reduce it even further with APC, memcache or wincache, but that requires adequate memory allocated to your cache (about 1.2k/cell).

If you're working with larger workbooks, then CLI is better than web interface to prevent timeouts.

 

pfeely wrote:

What is the largest worksheet size possible/been done before? 

I know that both borft and myself are working with large workbooks: in my case, 64k rows/worksheet for 2-3 worksheets with 100+ columns; but both of us have the available memory to throw at PHP and the time to process the workbooks.

 

pfeely wrote:

Is there a way to build the spreadsheet part at a time so I can schedule 5000 rows at a time in different scripts?

 

Not without a complete rewrite of PHPExcel that totally changes the way that it works, and that significantly reduces the ability of the library to work with different file formats, or switching to a non-PHP solution... unless you're outputting to CSV (and in that case, I'd assume you'd use PHP's standard fputcsv().

 

Oct 26, 2010 at 10:03 PM

Thanks

 

Basically the way we have our applications configured mysql query results return arrays (For ease of use - and preparation to switch over to a mysqlnd function ) - so thats the reason its the way it is - even though we could save a little bit of memory by using other methods it

wouldnt change the end result.

 

Our server isn't bad/slow - its just a production server with 100 websites on it - so I'm not real keen on turning the memory limit off and letting it run for 8 hours. (Even though it should just chug away and eat into the swap file)

 

Anyway - thanks for your help  - I might need to go back to a csv for this one.