Stripping ASCII codes when parsing
David Pratt
fairwinds at eastlink.ca
Mon Oct 17 13:30:42 EDT 2005
Hi Steve. My plan is to parse the data removing the control characters
and validate to data as records are being added to a dictionary. I am
going to Unicode after this step but before it gets into storage (in
which case I think the translate method could work well).
The encoding itself is not explicit for this data except to say that it
is ASCII and that besides not using chars 0-30, ASCII 128-254 is
permitted. I am not certain whether I should assume cp1252 or
ISO-8859-1. I can't say that everyone is using Windows although likely
vast majority for sure.
Would you think it safe to unicode before or after seeking out control
characters and validating stage? My validations are relatively simple
but to ensure that if I am expecting a date, integer, string etc the
data is what it is supposed to be, (since next stage is database),
unify whitespace, remove control characters, and check for SQL strings
in the data to prevent any stupid things from happening if someone
wanted to be malicious.
Regards,
David
On Monday, October 17, 2005, at 12:49 PM, Steve Holden wrote:
> David Pratt wrote:
> [about ord(), chr() and stripping control characters]
>> Many thanks Steve. This is good information. I think this should work
>> fine. I was doing a string.replace in a cleanData() method with the
>> following characters but don't know if that would have done it. This
>> contains all the control characters that I really know about in normal
>> use. ord(c) < 32 sounds like a much better way to go and
>> comprehensive.
>> So I guess instead of string.replace, I should do a ... for char
>> in ... and check evaluate each character, correct? - or is there a
>> better way of eliminating these other that reading a string in
>> character by character.
>>
>> '\a','\b','\e','\f','\n','\r','\t','\v','|'
>>
>
> There are a number of different things you might want to try. One is
> translate() which, given a string and a translate table, will perform
> the translation all in one go. For example:
>
>>>> delchars = "".join(chr(i) for i in range(32)) + "|"
>>>> print repr(delchars)
> '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12
> \x13\x14\
> x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
>>>> nultxfrm = "".join(chr(i) for i in range(256))
>>>>
>
> So delchars is a list of characters you want to remove, and nultxfrm is
> a 256-character string where the nultxfrm[n] == chr(n) - this performs
> no translation at all. So then
>
> s = s.translate(nultxfrm, delchars)
>
> will remove all the "illegal" characters from s.
>
> Note that I am sort-of cheating here, as this is only going to work for
> 8-bit characters. Once Unicode enters the picture all bets are off.
>
> regards
> Steve
> --
> Steve Holden +44 150 684 7255 +1 800 494 3119
> Holden Web LLC www.holdenweb.com
> PyCon TX 2006 www.python.org/pycon/
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list