Stripping ASCII codes when parsing
Steve Holden
steve at holdenweb.com
Mon Oct 17 11:49:32 EDT 2005
David Pratt wrote:
[about ord(), chr() and stripping control characters]
> Many thanks Steve. This is good information. I think this should work
> fine. I was doing a string.replace in a cleanData() method with the
> following characters but don't know if that would have done it. This
> contains all the control characters that I really know about in normal
> use. ord(c) < 32 sounds like a much better way to go and comprehensive.
> So I guess instead of string.replace, I should do a ... for char
> in ... and check evaluate each character, correct? - or is there a
> better way of eliminating these other that reading a string in
> character by character.
>
> '\a','\b','\e','\f','\n','\r','\t','\v','|'
>
There are a number of different things you might want to try. One is
translate() which, given a string and a translate table, will perform
the translation all in one go. For example:
>>> delchars = "".join(chr(i) for i in range(32)) + "|"
>>> print repr(delchars)
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
>>> nultxfrm = "".join(chr(i) for i in range(256))
>>>
So delchars is a list of characters you want to remove, and nultxfrm is
a 256-character string where the nultxfrm[n] == chr(n) - this performs
no translation at all. So then
s = s.translate(nultxfrm, delchars)
will remove all the "illegal" characters from s.
Note that I am sort-of cheating here, as this is only going to work for
8-bit characters. Once Unicode enters the picture all bets are off.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
More information about the Python-list
mailing list