Checking strings for "bad" characters

Wed Aug 28 04:41:31 EDT 2002

Peter Hansen wrote:

> Harvey Thomas wrote:
> > 
> > I've got some very long Unicode strings which I wish to 
> test for the presence of ASCII characters 0-8 and 14-31. My 
> first thought was to use regular expressions, e.g.:
> > 
> > import re
> > r = re.compile(u'[%s%s]' % (''.join([unichr(x) for x in 
> range(0, 9)]) , ''.join([unichr(x) for x in range(14, 32)])))
> > amatch = r.search(r)
> > if amatch:
> >     print "Bad characters"
> > else:
> >     print "OK"
> > 
> > but is there a better or faster method.
> 
> If you could use string.maketrans and .translate() to convert 
> all bad characters
> that might be present into a single code (e.g. \x00), and 
> then do a simple
> .find() for that character, you might get the benefits of 
> simplicity and extreme
> speed.
> 
> -Peter

Thanks for the suggestion Peter, but it's much slower than the RE. I guess it's because of the creation of a new string which is then scanned. I suppose the only thing faster than the RE is a C extension, but I don't think the effort needed for that is worthwhile.

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.