Stripping ASCII codes when parsing

Mon Oct 17 13:30:42 EDT 2005

Hi Steve.  My plan is to parse the data removing the control characters  
and validate to data as records are being added to a dictionary. I am  
going to Unicode after this step but before it gets into storage (in  
which case I think the translate method could work well).

The encoding itself is not explicit for this data except to say that it  
is ASCII and that besides not using chars 0-30, ASCII 128-254 is  
permitted. I am not certain whether I should assume cp1252 or  
ISO-8859-1. I can't say that everyone is using Windows although likely  
vast majority for sure.

Would you think it safe to unicode before or after seeking out control  
characters and validating stage? My validations are relatively simple  
but to ensure that if I am expecting a date, integer, string etc the  
data is what it is supposed to be,  (since next stage is database),  
unify whitespace, remove control characters, and check for SQL strings  
in the data to prevent any stupid things from happening if someone  
wanted to be malicious.

Regards,
David

On Monday, October 17, 2005, at 12:49 PM, Steve Holden wrote:

> David Pratt wrote:
> [about ord(), chr() and stripping control characters]
>> Many thanks Steve. This is good information. I think this should work
>> fine. I was doing a string.replace in a cleanData() method with the
>> following characters but don't know if that would have done it. This
>> contains all the control characters that I really know about in normal
>> use. ord(c) < 32 sounds like a much better way to go and  
>> comprehensive.
>>   So I guess instead of string.replace, I should do a    ...  for char
>> in ...    and check evaluate each character, correct? - or is there a
>> better way of eliminating these other that reading a string in
>> character by character.
>>
>> '\a','\b','\e','\f','\n','\r','\t','\v','|'
>>
>
> There are a number of different things you might want to try. One is
> translate() which, given a string and a translate table, will perform
> the translation all in one go. For example:
>
>>>> delchars = "".join(chr(i) for i in range(32)) + "|"
>>>> print repr(delchars)
> '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12 
> \x13\x14\
> x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
>>>> nultxfrm = "".join(chr(i) for i in range(256))
>>>>
>
> So delchars is a list of characters you want to remove, and nultxfrm is
> a 256-character string where the nultxfrm[n] == chr(n) - this performs
> no translation at all. So then
>
>      s = s.translate(nultxfrm, delchars)
>
> will remove all the "illegal" characters from s.
>
> Note that I am sort-of cheating here, as this is only going to work for
> 8-bit characters. Once Unicode enters the picture all bets are off.
>
> regards
>   Steve
> -- 
> Steve Holden       +44 150 684 7255  +1 800 494 3119
> Holden Web LLC                     www.holdenweb.com
> PyCon TX 2006                  www.python.org/pycon/
>
> -- 
> http://mail.python.org/mailman/listinfo/python-list
>