Stripping ASCII codes when parsing

Mon Oct 17 11:49:32 EDT 2005

David Pratt wrote:
[about ord(), chr() and stripping control characters]
> Many thanks Steve. This is good information. I think this should work 
> fine. I was doing a string.replace in a cleanData() method with the 
> following characters but don't know if that would have done it. This 
> contains all the control characters that I really know about in normal 
> use. ord(c) < 32 sounds like a much better way to go and comprehensive. 
>   So I guess instead of string.replace, I should do a    ...  for char 
> in ...    and check evaluate each character, correct? - or is there a 
> better way of eliminating these other that reading a string in 
> character by character.
> 
> '\a','\b','\e','\f','\n','\r','\t','\v','|'
> 

There are a number of different things you might want to try. One is 
translate() which, given a string and a translate table, will perform 
the translation all in one go. For example:

  >>> delchars = "".join(chr(i) for i in range(32)) + "|"
  >>> print repr(delchars)
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
  >>> nultxfrm = "".join(chr(i) for i in range(256))
  >>>

So delchars is a list of characters you want to remove, and nultxfrm is 
a 256-character string where the nultxfrm[n] == chr(n) - this performs 
no translation at all. So then

     s = s.translate(nultxfrm, delchars)

will remove all the "illegal" characters from s.

Note that I am sort-of cheating here, as this is only going to work for 
8-bit characters. Once Unicode enters the picture all bets are off.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC                     www.holdenweb.com
PyCon TX 2006                  www.python.org/pycon/