Jul 9, 2008 at 9:02 PM
Edited Jul 9, 2008 at 9:04 PM
As much as I like PHPExcel, creating Excel files has still been giving me headaches over the last few weeks. I’ve figured out a solution and I want to see if we can incorporate it into PHPExcel.
My current PHPExcel related project involves outputting database contents to an Excel file.
A typical output file consists of a total of 134 columns distributed over multiple worksheets and each worksheet having up to 1500 rows leading 210.000 cells in total.
My current issue is with speed. For a medium file, say 250 rows and up PHP Excecution time surpasses 60 seconds. Upgrading it to 300 seconds both in php.ini
and apache’s httpd.conf extends capabilities but merely postpones the problem. Moreso… waiting users are complaining users and there is no way I can get away with this. I’ve been reading about more people running into this so I’ve been doing some research
into solving this.
Now, before I go into it I just want to make clear that I understand it’s all a matter of what software is designed for. So, as much as do have some issues
in my particular usage of the library I don’t think of it less brilliant. I love the way it is set up, especially the API and the flexibility. Personally I mostly use it as an outputting engine but do consider the Writer classes as just one aspect of a wide
functionality. And, as usual, if you zoom in on one thing observations may appear out of context but I do want to make sure you understand that it’s not my intent.
Anyway: here’s what I’ve found and I assume most of this will be familiar to you:
1. PHPExcel stores the spreadsheet in memory twice.
Ofcourse, the PHPExcel object only needs to exist once for a single workbook, but as soon as I save the workbook the Writer class constructs the actual xml
sheet files (xl/worksheets/sheet1.xml, etc) inside a string before adding them to the zip file. To me this is clearly beyond obvious because we need to construct the file, but it does mean that a second representation of the spreadsheet is created when we
2. Multiple cell parses
If we populate and save a 100.000 cell worksheet we parse all the data while copying it from the database into the worksheet (creating cells, styles, etc.).
Then, if we save it, we parse it again from the Writer classes.
3. Shared strings
If we populate the database with many similar strings we create many copies of the same data in memory. The Writer classes consolidate the data into a shared
string table. However, in the meantime we do use a lot of memory.
These three items make perfect sense to me. Especially if the intent is to have an in memory spreadsheet with capabilities of modification and a rich feature
set. However, if we look purely and solely from the point of view in which we want to output data from a database to an excel file these steps aren’t necessarily needed, in fact may be slowing down the process and I’m thinking of finding a way to have this
rich and flexible feature set that is in PHPExcel now, while maintaining a very high performance level if we just want to use it to save data, to effectively use it as a conversion method, really.
Here’s what I tried:
I created a completely different lite version of PHPExcel from scratch. It stores cell data similar to PHPExcel with PHPExcel_Cell classes. Styles are not
it yet. The basic interface is similar to PHPExcel, yet very limited for now.
While constructing the sheets, after writing all the cell information to <sheetData> but before finishing that section, I created an indefinite loop
structure that triggers an OnTraverse() event for every iteration and then OnCreateData() for every sheet individually.
In the OnTraverse() event the calling application may forward the pointer to a datasource, as in ‘go to the next record’. In OnCreateData the calling application
returns a set of data that represents one single row in the spreadsheet. The Writer saves it into the constructed worksheet xml files immediately, bypassing the creation of cell and style classes. All content is passes through a sharedStrings filter first
and as such compressed.
Result: Writing 3.000 rows of 56 columns (168.000 cells) consumed 23Mb peak memory and 7 seconds. That is but a fraction of PHPExcel which required almost
5 minutes and 165Mb peak memory.
So, I guess the question is: can we find a way to incorporate the best of both worlds (I hope so)… or should PHPExcel have a sister library that is primarily
focused on outputting data fast and quickly (I hope not).
I will email you the source code of what I did but I’m posting this here anyway, so that anyone can join in the thinking.
Let me know what you think. There is so much more to this than just an idea. I think it also comes down to how you want to position PHPExcel.