Strange behavior while reading in chunks - please help

Topics: Developer Forum, User Forum
Nov 25, 2011 at 7:19 AM

Hi folks,

I have a class with a function import() which worked correct while NOT reading a file in Excel5 format in chunks.
Now I changed it to read the file in chunks and it behaves in a strange way. Maybe I am missing something or made a mistake somewhere...

This is what happens:

Assume I set the chunk size to 10 for testing purposes.

The first loop through the read chunk loop correctly returns the cells of the first 10 rows.

The second loop through the read chunk loop returns 10 rows of empty cells (but correct amount of cells) PLUS the cells of the next 10 rows!

All loops after that again return the cells of 10 rows like expected.

To make it more clear:

1st loop:
 - 10 rows with correct data of the cells
2nd loop:
 - 10 rows with returning empty cell data
 - 10 rows with correct data of the cells
All further loops:
 - 10 rows with correct data of the cells

While all that, $objWorksheetRow->getRowIndex() always returns the correct index number, which means, in the first loop 1 to 10, but in the second loop 1 to 20, since the 3rd loop again 1 to 10.

It seems I missed something or made a mistake. - I already stripped down my code as far as possible to run debug-tests but can't find the what's wrong. Maybe I'm blind =-(

Could someone please take a look at my stripped down example code? - I searched for the reason for two days now and am somewhat desperate.

Many thanks in advance!

Here's some stripped down code:

class chunkReadFilter implements PHPExcel_Reader_IReadFilter {
	private $_startRow = 0;
	private $_endRow = 0;
	private $_columnLetters = array();
	private $_worksheet_name = '';

	/**  Set the list of rows that we want to read  */
	public function setRows($startRow, $chunkSize) {
		$this->_startRow	= $startRow;
		$this->_endRow		= $startRow + $chunkSize;
	}

	public function setColumns($columnLetters) {
		$this->_columnLetters	= $columnLetters;
	}

	public function setWorksheetName($worksheet_name) {
		$this->_worksheet_name	= $worksheet_name;
	}

	public function readCell($column, $row, $worksheetName = '') {
		if (count($this->_columnLetters) > 0) {
			if ( $row >= $this->_startRow && $row < $this->_endRow && $this->_worksheet_name == $worksheetName && array_search($column, $this->_columnLetters) !== FALSE) {
				return TRUE;
			}
		}
		else {
			if ( $row >= $this->_startRow && $row < $this->_endRow && $this->_worksheet_name == $worksheetName ) {
				return TRUE;
			}
		}

		// Just for debugging:
		if ( ! defined('SKIPPING_STARTTIME') ) {
			define('SKIPPING_STARTTIME', microtime(TRUE));
		}

		return FALSE;
	}
}

 

The next part of code is inside a class in my import function:

class ...

function ...

$this->objReader = PHPExcel_IOFactory::createReader($this->import_session['inputFileType']);

$this->objReader->setLoadSheetsOnly($this->import_session['worksheetName']);

$this->objReader->setReadDataOnly(TRUE);

$this->chunkFilter = new chunkReadFilter();

$this->objReader->setReadFilter($this->chunkFilter);

$first_row = 1;

$this->chunkFilter->setWorksheetName($this->import_session['worksheetName']);

for ($StartRow = $first_row; $StartRow <= $this->import_session['HighestRow']; $StartRow += $this->chunkSize) {
	$this->chunkFilter->setRows($StartRow, $this->chunkSize);

	try	{
		$this->objPHPExcel = $this->objReader->load($inputFileName);
	}
	catch(Exception $e) {
		$GLOBALS['objMessagestack']->add_message('%s, line %s: %s!', 'error', FALSE, array(basename(__FILE__), __LINE__, $e->getMessage()));
	}

	$this->objPHPExcelWorksheet = $this->objPHPExcel->getSheet($this->import_session['selected_worksheet_id']);

	foreach ($this->objPHPExcelWorksheet->getRowIterator() AS $objWorksheetRow) {
		$chunk_row_idx = $objWorksheetRow->getRowIndex();

		$cellIterator = $objWorksheetRow->getCellIterator();

		$cellIterator->setIterateOnlyExistingCells(FALSE); // Loop all cells, even if it is not set

		foreach ($cellIterator as $cell) {
			// Collect the data of the row:
			$cell->setDataType(PHPExcel_Cell_DataType::TYPE_STRING);
			$cellvalue = $cell->getCalculatedValue();

			... process $cellvalue ...
		}

	} // End getRowIterator
} // End Chunk reading

Nov 25, 2011 at 10:36 AM

Sorry, I made a mistake describing the strange behavior, I interrupted the script too early to notice that.

Not all loops since the second loop are OK:

It behaves like this:

loop 1:
 - 10 rows with correct data of the cells

since 2nd loop:
 - (loop-nr - 1) * 10 rows with returning empty cell data
 - 10 rows with correct data of the cells

Again: Thanks in advance for every hint.

Yours Henri

Coordinator
Nov 28, 2011 at 1:11 PM

This is expected behaviour... PHPExcel still reads cell data to the correct cell addresses, it simply reads a defined block of cell addresses rather than reading the entire workbook in one go. So if you tell it to read cells A10 to E20, it will read these cells as A10 to E20, not as A1 to E11.

Nov 28, 2011 at 6:38 PM

First of all: Many thanks for your reply Mark.

But.. Sorry, I don't understand your answer:

What do you mean with: A10 to E20? - All cells from column A row 10 to column E row 20 ?

And how is that related to my problem?

Am I wrong with the following understanding?

Let's say I have 100 rows and columns from A - Z

-  Shouldn't $this->objPHPExcelWorksheet->getRowIterator() give me all $objWorksheetRow s that were not filtered out by the chunkFilter when objReader->load(...)?
-  Which means when I have a chunkSize of 10 it should give me ten $objWorksheetRow s for the rows from $StartRow to ($StartRow + $this->chunkSize - 1)
-  Shouldn' it then always count when $chunk_row_idx = $objWorksheetRow->getRowIndex(); from 1 to 10?
-  Shouldn't the cellIterator give me all cells of one row and not more?

So, is the method to use a row-iterator and a cell-iterator not the best solution when reading a worksheet in chunks?

Should I better manually count rows and cols and retrieve the cell's data with $this->objPHPExcelWorksheet->getCellByColumnAndRow($col, $row)->getCalculatedValue() ?

Many thanks in advance,
    yours Henri

Coordinator
Nov 28, 2011 at 7:43 PM

The chunk filter is applied to the reader, so in the second iteration (with start 10, size 10) it reads only cells in rows 10 to 19 from the workbook file. Rows 1-9 are empty rows, containing no cells, because they are not read from the workbook file. The filter does not, however, affect anything other than the reader, so it has no effect on the rowIterator at all. The rowIterator always starts with row 1, so it will still iterate over the initial empty rows before reaching row 10 and the other rows that have actually been read from the workbook file: effectively, the behaviour is setIterateOnlyExistingRows(FALSE); - although there is no option to change this.

I'll look to provide a method to set the start row for a rowiterator, both in the constructor and as a manually callable method.

Nov 28, 2011 at 10:47 PM
Edited Nov 28, 2011 at 10:50 PM

Ahaaaa... :-) Many thanks for that answer Mark, now it's clear to me.

 

You also wrote. "I'll look to provide a method to set the start row for a rowiterator, both in the constructor and as a manually callable method."...

That would be fine for future codings with PHPExcel, but if it would be possible, it would even be better (only for the reading of a row-range) to provide a method like...

$objReader->load($inputFileName, $first_row_to_load, $last_row_to_load);

...which would really only iterate to load from $first_row_to_load until $last_row_to_load without the need of the filter.

If it's not possible to really directly set a pointer to let's say row $first_row_to_load to start loading, it would at least be very useful to be able to "break" loading after loading of $last_row_to_load.

The reason is, that if I have a worksheet with 60000 rows with about 20 columns, read in chunks of 1000 rows (and such big files have to be read in chunks with PHPExcel), even filtering out the other 59000 rows once needs about 3 minutes depending on the processor-power of the computer. And this has to be done 60 times!
This means that ONLY the skipping of the rows when read in chunks of 1000 rows needs one to several hours for the whole file when using a chunkFilter! - Really loading the not skipped cells needs additional time.

 

I also make extensive use of the new method listWorksheetInfo described in http://phpexcel.codeplex.com/discussions/259971 and expecially of my little extension for it to return also an array of all column-letters and my method compareColumnLetters(...) which are very, very useful.
Maybe you like to take a look at the thread and think about adding it also to PHPExcel.

Again, many thanks for your help and great effort on this project,
    yours Henri