Handling very large files

Feb 8, 2010 at 6:32 PM

Hello,

After some searching around on the internet I came across PHPExcel some time back for importing spreadsheets into our application. It worked very well for the small 10 to 200 row test files we created. I was impressed with the 2005 support etc, the interface is very clean but some of the things that worried me was what would happen when we dealt with larger files. While not that we are testing we have ran into the below problem; any thoughts would be valued.

Problem

We have a 5mb file, with 50,000 rows, this data sampling was grabbed from: http://www.fakenamegenerator.com/order.php

 

<?php

require_once 'PHPExcel/IOFactory.php';

PHPExcel_IOFactory::load('../db_names.csv');
var_dump(memory_get_peak_usage());

After roughly 3 minutes it returns 900mb. This is on a 2.3gigz single core Xeon VM.

 

Thoughts

We assume much of the reason that the entire files are loaded into memory is for calculation of formulas etc. We are already planning on writing a simplified CSV parser that inserts into our database as it reads input so nothing exists in memory. We are also planning on offloading these jobs into some of our worker queues. Our prod machines are much more powerful, roughly 64gb of ram and 16 cores of processor. However, given the nature of PHP I am not sure vertical scaling will be practical. 15gb php process, is that possible? Sounds kinda gross :P

I would like to know if a method exist to read a excel spreadsheet row by row, without formula calculations etc, so we may insert into a database as we read each row, instead of inserting everything into memory. Ideally I would like to read chunks of the file and pass them to the PHPExcel library, I understand if this is not possible and I am okay with loading a 100mb string into memory, just based off my findings I can not afford to then transform that string into the rather complicated object representation PHPExcel constructs which is exponentially larger.

 

Thank you for your time,

-Chris

 

Coordinator
Feb 9, 2010 at 8:44 AM

We're very aware of the memory usage of PHPExcel, and are working hard on reducing that memory footprint with every release (while still trying to add the new functionality that users are asking for); and in the longer term, we're looking at cacheing the worksheet data so that it isn't all resident in memory. It's already possible to read in selected worksheets rather than the entire workbook using the setLoadSheetsOnly() method of the reader, and to apply a filter using the setReadFilter() method of the reader to restrict the data to a range of cells. Furthermore, with the setReadDataOnly(True) method in the reader, you can instruct PHPExcel to only read in the data for cells rather than all the formatting as well: it's often this formatting information that increases the memory requirements of reading a workbook.

What we can't do with PHPExcel is simply read the data into a PHP array: we read it into a series of objects because that's how we provide all the functionality of the library. Your assumption about it being the calculation engine that requires holding the entire file in memory is incorrect: it's so that the workbook can be manipulated/formatted/edited/etc before writing. PHPExcel is primarily a tool for writing Excel, with the readers to allow users to generate templates that can then be populated form within their code before writing. We do provide the toArray() method to transform a worksheet into a simple PHP array.

If you're working only with CSV, then PHP's fgetcsv() function is the simplest answer. If you need similar for Excel, then there are plenty of alternative Excel readers that do just that, including the MS-Excel Stream Handler class at http://www.phpclasses.org/browse/package/1919.html which reads an Excel workbook (Excel 5) as a file stream (which certainly sounds more like you're asking for). I'm not aware of anything comparable to the stream handler class for Excel 2007, although it shouldn't be too difficult to write.

Feb 10, 2010 at 8:05 PM

Hello Mark,

Thank you for your response, I am pleased to hear I was off as far as the formula calculations and am glad there is some work-arounds as far as memory is concerned. I am thankfull for the time you spent giving me other resources, if need be I will look into them; however I am very interested in how I may use PHPExcel to fit my needs.

I have looked into the code and familiarized myself a bit with it. I am interested in your thoughts of setting some kind of write context for each row, instead of going to a worksheet where they cumulate into a eventual memory consuming object representation. For example setting a write context would just be a class which places the row into the appropriate place, this may be pushing it through some IO or tacking it into an attached worksheet instance like it downs now. Before I dive into something like this, do you have any thoughts or quick reasons why this would not be possible?

Thank you for your time,

-Chris

Coordinator
Feb 10, 2010 at 10:59 PM
Edited Feb 10, 2010 at 11:00 PM
cstockton wrote:

Thank you for your response, I am pleased to hear I was off as far as the formula calculations and am glad there is some work-arounds as far as memory is concerned. I am thankfull for the time you spent giving me other resources, if need be I will look into them; however I am very interested in how I may use PHPExcel to fit my needs.

 In addition to the general methods that I identified using PHPExcel settings to reduce memory usage for stored cells by restricting the data loaded, the latest SVN code contains a "lazy loader" that only loads the classes that PHPExcel requires when they are actually needed, so if you aren't using the PDF writer (for example) none of the PDF writer code will ever be "required" into memory. This reduces the memory footprint of the script itself; and I shall be spltting the functions.php script used by the calculation engine so that if none of the formulae in a workbook use (for example) Financial functions, or Statistical functions from the analysis toolpack, then they won't be "required" into memory either.

Currently, memory savings from the "lazy loader" average between 0.5 and 2.5MB in the actual memory requirement for the script itself, allowing more cells to be read from an Excel file before hitting any memory limit. Our current estimate is that each cell object uses an average of about 1k of memory, so this change would allow workbooks with between 500 and 2500 more cells before hitting memory limits.

cstockton wrote:

I have looked into the code and familiarized myself a bit with it. I am interested in your thoughts of setting some kind of write context for each row, instead of going to a worksheet where they cumulate into a eventual memory consuming object representation. For example setting a write context would just be a class which places the row into the appropriate place, this may be pushing it through some IO or tacking it into an attached worksheet instance like it downs now. Before I dive into something like this, do you have any thoughts or quick reasons why this would not be possible?

 I know that Erik has some ideas of his own for cell data cacheing that he's been looking at for some time now, so he's probably the best person to talk to about this.

Independently, I've been looking at spl (and cursing the lack of documentation about it in the PHP manual), specifically the cachingIterator used for either splObjectStorage or splDoublyLinkedList. This could not only allow cell data to be "memory resident" only when it was actually being referenced within the code, but also allow me to implement splObservers to handle automated recalculation of affected formulae when a cell value was changed. However, I haven't been able to identify how spl cacheing actually caches the data, whether it uses /tmp filespace, or if it can be configured to use APC memory, memcache, indexed database tables, or other storage. I'll probably end up pulling the source to work it out.

However we end up implementing it though, we don't want to make memory savings at a massive cost in performance.