Problems loading and accessing an excel file

Topics: Developer Forum
Nov 29, 2012 at 12:11 PM

First of all, I have to say that I've read a lot of post about this problem, but I think I'm not getting the point about techniques when loading big files.

I have an excel file of 65000 rows (aprox) and 164 columns (EX column is the last one), that's give me a 10.660.000 cells.

I'm trying to load it to get values (only for reading) and I've tryed a lot of alternatives but none of them has worked (sometimes because memory limits and sometimes because time_execution).

I'm running a local server with WAMP, 640M of memory_limit and 3000 for time_execution. 
I've tried the following techniques:

  1. Basic loading: problems with memory usage.
  2. Basic loading with cache methods (disk, sqlite3,...): Problems with memory usage.
  3. Chunks loading (with a ReadFilter): problems with time_execution 

I think I'm not understanding the load by chunks method, because I've tried to load only 20 rows, or 50 and it takes the same amount of time.. which has sense for me because is an unique load task (I write this attempt outside of a bucle).

I've tried to load a 1000 rows with this technique and the server goes down after two intensive memory usages... (I've seen that in the performance pannel at windows task manager)

I need to know if my excel is too big for PHPExcel library, if there is an specific format of excel documents that works better in terms of performance... all the info you can give me is much appreciated... :(

I'm using other optimizing techniques like:

 

$objReader->setReadDataOnly(true);
$objReader->setLoadSheetsOnly(SHEET_NAME);

 

Thanks in advance,
Marcos. 

Nov 29, 2012 at 1:02 PM

Hi,

 

I've got some experience with large excelsheet, using the PHPExcel library. But rather than reading I used PHPExcel for writing. The biggest files I worked with had around 30 columns with 150K rows, giving less than half the number of cells you have. With these kind of numers, I was really pushin the limit of my hardware:

2x quadcore xeon + 24GB memory (memory being the main bottleneck).

 

As you have more than twice the number of cells, I fear you are trying to push PHPExcel a little too far, but perhaps Mark has some tips for you!

 

(I eventually ended up writing my own xlsx writer as datasets became to big. It employs streaming to insert data into the xlsx file, so memory is not an issue anymore)

Nov 29, 2012 at 1:34 PM

Hi borft,

Thanks a lot for your numbers, are really interesting.

Finally I'm being able to deal with this "little monster" :)
The key was to really understand how PHPExcel works when loading files and working with chunks. I noticed that the same excel file takes the same time (more or less) when loading, not depending of the chunk size that will be applied later to process data. Memory limits and chunksize are important, when working with data (after the loading is completed) to allow filters to read in chunks and do the work without errors.

At this moment my findings are the followings:

Common enviroment

  • max_execution_time = 600 (php.ini)
  • max_input_time = 600 (php.ini)
  • memory_limit = 512M (php.ini)
  • excel file 52MB size, with 65000 rows and 164 columns (10.660.000 cells)
  • Common Loading time, about 170s.

Test A. With cache_to_sqlite3 method

  • 1000 rows / +4s aprox.
  • 2000 rows / +18s aprox.
  • 4000 rows / +50s aprox.
  • 5000 rows / +67s aprox.
  • 6000 rows / +88s aprox.

Test B. With cache_in_memory_serialized method

  • 6000 rows / +52s aprox (40% speed improvement against cache_to_sqlite3)
  • 10000 rows / +70s aprox 

There is not significant differences talking about speed, between using cache_in_memory_serialized and not using a caché method, but if you don't use a cache method you get a memory limit error.

So this way I can setup a session based process to do the work step by step reloading the page (not in a loop) and get the file fully processed in about 24 minutes (6 steps of 10.000 rows each one).

When I complete my test (I'm going to reduce memory setup values), I'll post the final code here.

Thanks a lot,
Marcos.