Best way to Sanitize incoming data ?

Topics: User Forum
Jul 12, 2011 at 12:02 PM

Hi

Is there a best way to sanitize all incoming spreadsheet data?

Concerned about security, etc

Thanks

Feb 2, 2012 at 3:55 PM

Any thoughts on this?

I'd like to run the incoming file and data through a number of filters before posting to mysql and was wondering if anyone can recommend a good approach

Feb 10, 2012 at 6:29 PM

I just use this for the data...

$str = preg_replace('/
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
/', '', $str); // http://www.w3.org/International/questions/qa-forms-utf-8

$str = addslashes($str);

Feb 13, 2012 at 1:27 PM

Hello  i'm concerned about security too.

Scott do you apply this code after the getCell(), getValue() calls?

I am just doing something like strip_tags().

Isnt that regular expression a little heavy on memory resources(if you apply it on every cell value)?

Feb 13, 2012 at 1:38 PM

I do it after getValue(), not sure about memory, but our data goes to a system that will choke on any invalid UTF-8 in strings.

Feb 13, 2012 at 1:58 PM

sure its useful if you are importing in a db.Thanks for sharing that.

Apr 23, 2012 at 1:25 PM
Edited Apr 23, 2012 at 1:26 PM

I'm worried about memory too, some users might upload files with say 5000 rows of data and each cell needs to be checked.

Would there be a way to extract the raw data first using php or perl and then submit it to a some sort of filter perhaps?

Thanks

Nov 13, 2012 at 2:07 PM
Edited Nov 13, 2012 at 2:09 PM

regarding the line:

[\x09\x0A\x0D\x20-\x7E]     # ASCII

do you mean that all ascii characters would be removed from incoming data?

what i'm trying to do is (1) allow only a few spcified ascii characters and symbols, (2) exclude certain words or phrases like the word "hate", and (3) sanitize any command-type stuff

would this be possible?

thanks

Nov 15, 2012 at 2:40 PM
Edited Nov 15, 2012 at 2:42 PM

Sounds like you are trying to filter the actual cell data characters against a white list. Then filtering the cell data words against a black list.

1) You can use preg_replace (php command) to remove any characters that do not exist in your white list of 'valid' characters.

2) You can use str_replace (php command) to remove any words/phrases from the data. You black list would be an array of words/phrases you wish to remove. Make sure your list is ordered with the longest phrases first and then the single words last.

3) See number 2 above.

- Christopher Mullins

Nov 16, 2012 at 3:22 PM
Edited Nov 16, 2012 at 4:18 PM

yes, thanks :)

1. should i remove the ASCII part of the original preg_replace above?

2. also, what does the first preg_replace above do? (i'm not yet familiar with hex things)

$str = preg_replace('/
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
/', '', $str); // http://www.w3.org/International/questions/qa-forms-utf-8

$str = addslashes($str);

3. so i guess i'd be doing three filters, (1) filter using the first preg_replace provided above, (2) filter for allowed ascii characters, and (3) filter for un-allowed words or phrases

sound about right?

thanks
Nov 17, 2012 at 12:22 AM
Edited Nov 17, 2012 at 12:25 AM

1) This seems to be an encoding thing which I'm not sure about. Someone else might be able to give you an answer on that.

2) preg_replace('<Pattern to search for>', <String to replace patterns with>, <string variable to search in>)

The first argument is a Perl Regular Expression (PREG) which appears to be a list of patterns connected with an inclusive or symbol, so...

[\x09\x0A\x0D\x20-\x7E] - Find any of these ASCII characters ASCII Values [9, 10, 14, 32 - 126] or

[\xC2-\xDF][\x80-\xBF] - Find any [194 - 223, 128 - 191] or

\xE0[\xA0-\xBF][\x80-\xBF] - Find any matching: 224[160 - 191][128 - 191] or

[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} - Find any matching: [225 - 236, 238, 239][128 - 191] minimum of 2 characters or

\xED[\x80-\x9F][\x80-\xBF] - Find any matching: 237[128 - 159][128 - 191] or

\xF0[\x90-\xBF][\x80-\xBF]{2} - Find any matching: 240[144 - 191][128 - 191] or

[\xF1-\xF3][\x80-\xBF]{3} - Find any matching: [241 - 243][128 - 191] minimum of 3 characters or

\xF4[\x80-\x8F][\x80-\xBF]{2} - Find any matching: 244[128 - 191][128 - 191] minimum of 2 characters

The second argument is an empty string so anything matching the above pattern would be removed from the string.

3) Since you have an example using the OR symbol (|) you could created a couple of preg_replace commands to take

care of everything if your Black List rarely changes.

- Christopher Mullins