Checking strings for "bad" characters

Tue Aug 27 10:36:52 EDT 2002

This should be reasonably efficient.  For instance, it should beat the
approach of
	for c in range(0, 9) + range(14, 32):
		if unichr(c) in long_unicode_string
since this version searches the string many times.

If matching on unicode strings is too much slower than matching on regular
strings, maybe a sequence like
	long_utf8_string = long_unicode_string.encode("utf-8")
	r.search(long_utf8_string)
i.e., does the possibly faster search make up for the possibly slow
conversion?

My testing suggests that this is not so.  See program below..
unicode 3.69167995453
unicode re 3.7019649744
utf-8 6.15347003937

(unicode vs unicode re shows that it makes no difference that the RE string
is given in a unicode or 8-bit string)

Jeff

import time, re

us = "\u0100" * 1000 * 1000
r = re.compile('[%s]' % ''.join([chr(x) for x in range(0, 9) + range(14, 32)]))
ru = re.compile(u'[%s]' % ''.join([chr(x) for x in range(0, 9) + range(14, 32)]))

t = time.time()
for i in range(3):
	r.search(us)
print "unicode", time.time() - t

t = time.time()
for i in range(3):
	ru.search(us)
print "unicode re", time.time() - t

t = time.time()
for i in range(3):
	s = us.encode("utf-8")
	r.search(s)
print "utf-8", time.time() - t