Need help tweaking phpexcel to deal with limited memory and big xlsx file

Topics: User Forum
Jun 1, 2013 at 11:09 AM
Hello!

My app needs to read a 4 columns x 200k rows excel file and insert in into a database.

Memory limit is 384mb and host does not provide the necessary modules for most of the caching methods described in the docs. I believe my only option is cache_to_discISAM.

So far, I've only been able to load a "full" .xls file (65536 rows) without problems, although a bit slow. But the real big .xlsx file with it's ~200k rows will stop after the first chunk loop (something like 2k rows). The error given depends on the chunk size: either a ERR_CONNECTION_RESET or a fatal memory limit error.

Am I doing it right? Can you help me optimizing my code?

This is the reading:
// cache cells so PHP doesn't run out of memory
$cache_method = PHPExcel_CachedObjectStorageFactory::cache_to_discISAM;
if (!PHPExcel_Settings::setCacheStorageMethod($cache_method))
    die('caching was not possible');

try {
    // detect filetype
    $filetype = PHPExcel_IOFactory::identify('./uploads/'.$filename);

    // target sheet
    $sheet_name = 'D2D SP';

    // start and end rows by user input
    $start_row = ($this->input->post('start_row')||$this->input->post('start_row')<2)?$this->input->post('start_row'):2;
    $end_row = ($this->input->post('end_row')||$this->input->post('end_row')<=$this->input->post('start_row'))?$this->input->post('end_row'):$start_row+101;

    // split sheet into chunks so MySQL doesn't run out memory
    $chunk_size = ($end_row - $start_row)+1 < 2000? $end_row - $start_row + 1 : 2000; 
    $chunkFilter = new chunkReadFilter();

    // create reader based on detected filetype
    $phpExcelReader = PHPExcel_IOFactory::createReader($filetype);
    $phpExcelReader->setLoadSheetsOnly($sheet_name);
    $phpExcelReader->setReadDataOnly(TRUE);
    $phpExcelReader->setReadFilter($chunkFilter);

    $count_linhas_criadas = 0;
    $count_linhas_modificadas = 0;
    for ($i=$start_row; $i <= $end_row; $i += $chunk_size+1) {

        $chunkFilter->setRows($i, $chunk_size);

        // create THE object
        $phpExcel = $phpExcelReader->load('./uploads/'.$filename);

        // get worksheet dimensions
        $sheet = $phpExcel->getActiveSheet();
        $highest_row = $sheet->getHighestRow();
        // $highest_col = PHPExcel_Cell::columnIndexFromString($sheet->getHighestColumn());
        $highest_col = 'D';

        $values = $sheet->rangeToArray('A'.$i.':D'.($i+$chunk_size>$highest_row?$highest_row:$i+$chunk_size));

        $values =   array_map(function($single_row) {
                        $single_row = array_map(function($cell) {
                            $CI = get_instance();
                            return $CI->db->escape_str($cell);
                        }, $single_row);
                        return "('".implode("','",$single_row)."')";
                    }, $values);

        $num_of_rows = count($values);

        $values = implode(',',$values);

        if ($values) {

            $this->db->query("INSERT INTO enderecos_disponiveis (logradouro, numero, cep, bairro) VALUES $values ON DUPLICATE KEY UPDATE data_base = now()");
            $count_linhas_modificadas += $this->db->affected_rows() - $num_of_rows;
            $count_linhas_criadas += $this->db->affected_rows() - $count_linhas_modificadas;

        }

    }

    $this->db->select('id');
    $this->db->from('enderecos');

    $this->data['msg']['count_linhas_modificadas'] = $count_linhas_modificadas;
    $this->data['msg']['count_linhas_criadas'] = $count_linhas_criadas;
    $this->data['msg']['count_linhas_total'] = $this->db->count_all_results();
    $this->data['msg_tpl'] = 'admin/msg/enderecos_atualizar_sucesso';

    return $this->disponiveis();

} catch(Exception $e) {

    $this->data['msg'] = $e->getMessage();
    $this->data['msg_tpl'] = 'admin/msg/enderecos_atualizar_erro_excel';

    return $this->atualizar();

}
And this is the filter:
class chunkReadFilter implements PHPExcel_Reader_IReadFilter {

    private $start_row = 0;
    private $end_row = 0;

    public function setRows($start_row, $chunk_size) {

        $this->start_row = $start_row;
        $this->end_row = $start_row + $chunk_size;

    }

    public function readCell($col, $row, $worksheet_name = '') {

        if (($row == 1) || ($row>= $this->start_row && $row < $this->end_row)) {
            return TRUE;
        }
        return FALSE;

    }

}
Thanks in advance.
Coordinator
Jun 1, 2013 at 11:50 AM
I'd be inclined to iterate through the rows you've read one row at a time rather than using rangeToArray() for the whole chunk that you've read: that's duplicating the chunk in memory, adding the overhead of a 2000x4 array. It might be slower, because you're then inserting each row individually into the database, but should reduce memory usage
Jun 1, 2013 at 6:18 PM
Edited Jun 1, 2013 at 6:20 PM
Thanks for the fast reply. I'm gonna try that and report back.

Meanwhile, can you tell if there's another reason in this case why the .xls file runs smoothly while .xlsx does not (besides of course number of rows)? How these 2 compare in terms of performance for reading with PHPExcel? I suppose .csv is the winner?
Coordinator
Jun 1, 2013 at 6:29 PM
Edited Jun 1, 2013 at 6:31 PM
There's some basic stats here (though intended primarily to compare the performance of different versions of PHPExcel) but they also show the differences in speed and memory usage between CSV, Excel5 and Excel2007.

The main difference is the completely different structures of the files, and the way in which they can be read/written. At the moment, we use SimpleXML for reading Excel2007 and XMLWriter for writing Excel2007 (version 2.0 will use XMLReader for reading Excel2007 to reduce the memory usage, but at a slight cost in speed), whereas the Excel5 reader and writer are working with an OLE2 BIFF binary object, so there's speed overheads there in converting packed data to PHP datatypes, but less memory overhead; and the OLE2 code more easily allows seeking in the data. CSV, of course, has the advantage of not supporting any formatting, so there's no overheads for styles.
Jun 1, 2013 at 6:40 PM
Very nice.

About .csv being faster due to lack of formatting support, if I use setReadDataOnly(TRUE) when reading xls and xlsx, does this difference still exist?
Jun 1, 2013 at 7:34 PM
Edited Jun 1, 2013 at 7:56 PM
I've tested iterating one row at a time. It still runs out of memory after the first chunk iteration on the larger (200k rows) xlsx file and does finish executing OK the smaller (65k rows) xls file.

What I don't get is why it process the first chunk and only then it runs out of memory. Does the memory usage increase after each chunk iteration? Shouldn't it be replacing the same amount of data in memory for each chunk?

Maybe I should be killing the last chunk data in memory somehow?

EDIT/
It certainly is that. The xls test output proves it's not filtering from where the last chunk stopped, it's loading again every row from the first to the new limit instead. For instance: the first iteration prints 1 to 2001 and the second prints 1 to 4002. I'll try moving the filter object creation into the loop and see what it does.

Code (notice I've removed the database writing part for testing):
// detect filetype
$filetype = PHPExcel_IOFactory::identify('./uploads/'.$filename);

// target sheet
$sheet_name = 'D2D SP';

// start and end rows by user input
$start_row = ($this->input->post('start_row')||$this->input->post('start_row')<2)?$this->input->post('start_row'):2;
$end_row = ($this->input->post('end_row')||$this->input->post('end_row')<=$this->input->post('start_row'))?$this->input->post('end_row'):$start_row+101;

// split sheet into chunks to save memory
$chunk_size = ($end_row - $start_row)+1 < 2000? $end_row - $start_row + 1 : 2000; 
$chunkFilter = new chunkReadFilter();

// create reader based on detected filetype
$phpExcelReader = PHPExcel_IOFactory::createReader($filetype);
$phpExcelReader->setLoadSheetsOnly($sheet_name);
$phpExcelReader->setReadDataOnly(TRUE);
$phpExcelReader->setReadFilter($chunkFilter);

for ($i=$start_row; $i <= $end_row; $i += $chunk_size+1) {

    $chunkFilter->setRows($i, $chunk_size);

    // create THE object
    $phpExcel = $phpExcelReader->load('./uploads/'.$filename);

    // get worksheet dimensions
    $sheet = $phpExcel->getActiveSheet();

    foreach ($sheet->getRowIterator() as $row) {
        echo $row->getRowIndex().'</br>';
    }

    echo '======= yet another chunk =======';

}
Jun 1, 2013 at 8:43 PM
No luck... no matter what I do, the filter will always start from row 1 again instead of the dynamic $start_row passed to setRows().
Coordinator
Jun 2, 2013 at 10:21 AM
Yes, you should be clearing the last chunk from memory before loading the next, using
$objPHPExcel->disconnectWorksheets();
unset($objPHPExcel);
But for every loop you're iterating through all rows from row1 to rowN, no matter which chunk you've loaded. The RowIterator isn't particularly smart, and if you tell it to start from row 1 it will start from row 1 and count up 1 row at a time even though those rows have never been loaded. You can change this by telling the rowIterator where to start iterating.
foreach ($sheet->getRowIterator($i) as $row) {
Jul 17, 2013 at 2:37 PM
RZB wrote:
No luck... no matter what I do, the filter will always start from row 1 again instead of the dynamic $start_row passed to setRows().
public function readCell($col, $row, $worksheet_name = '') {

        if (($row == 1) || ($row>= $this->start_row && $row < $this->end_row)) {
            return TRUE;
        }
        return FALSE;

    }

Remove ($row == 1) in condition.