Processing iso8859-1 input in quite large spreadsheet

Topics: User Forum
Nov 29, 2013 at 6:48 PM
Edited Nov 29, 2013 at 6:58 PM
I export some data into a quite large spreadsheet. It consist of 2 sheets. The first sheet has roughly 2000 rows and 30 columns and the second sheet has roughly 3000 lines and 70 columns.

The problem is that the input encoding is iso-8859-1 so I have to do for each field which is string mb_detect_encoding and mb_convert_encoding. That means writing sheet with ->setCellValue.

I did several tests with caching (see table below, showing times), but I didn't get any improvement.

 |                     | Prepare  Create     Create      Writer  Total     Total |
 |                     | data     1st sheet  2nd sheet   ->save            (mins)|
 | No cache            | 48       38         113         128     326       5:26  |
 |                     |                                                         |
 | cache_in_memory     | 52       37         112         129     330       5:30  | 
 |                     |                                                         |
 | cache_in_memory     | 48       41         121         172     382       6:22  |
 |    _serialized      |                                                         | 
 |                     |                                                         |
 | cache_to_phpTemp    | 49       45         137         215     446       7:26  |  
 | $sheet->fromArray() | 48       17          64          83     212       3:32  |
The last line shows significant improvement when I create whole sheet from array(no caching). The downside is that I push latin-1 data in utf-8 sheet so all accented cells display null instead. I see that part of the improvement time is because of not calling these mb_* functions. But why there is also 30% improvement in Writer->save?

My main question is: Can I make the $sheet->fromArray() accept data in encoding iso-8859-1? That would greatly cut down the spreadsheet generation time.

  • I didn't measure memory use although the memory consumption is also an issue for me.
  • I didn't measure other caching methods, because they typically focus on lowering memory consumption as a trade-off for longer processing time.
  • I am not sure memoryCacheSize applies to other caching types than cache_to_phpTemp. I tried to set it and I got the same result as if I didn't.
  • All times are in seconds with the exception of the last column, which is in minutes.
Nov 29, 2013 at 7:37 PM
"No cache" and "cache in memory" are the same thing, ie there is always caching, and the default is cache_in_memory; which explains why these two sets of figures are almost identical. But using caching will not improve speed; it's designed to reduce memory usage, but has a cost in speed.

If you know that your data is always iso-8859-1, then why do you need mb_detect_encoding()?

PHPExcel always expects data in UTF-8. If it were to be modified to accept any encoding, then it would still need to convert it to UTF-8 internally, so you'd just be transferring that work from your own script to PHPExcel; potentially with the added overhead of needing to determine whether a conversion was needed.

If you have an array of iso-8859-1 data and want to use the fromArray() method, then you might get some performance improvement by handling the conversion to UTF-8 within an array_walk() or array_walk_recursive() call rather than in a manual loop