Checking strings for "bad" characters
jepler at unpythonic.net
jepler at unpythonic.net
Tue Aug 27 10:36:52 EDT 2002
This should be reasonably efficient. For instance, it should beat the
approach of
for c in range(0, 9) + range(14, 32):
if unichr(c) in long_unicode_string
since this version searches the string many times.
If matching on unicode strings is too much slower than matching on regular
strings, maybe a sequence like
long_utf8_string = long_unicode_string.encode("utf-8")
r.search(long_utf8_string)
i.e., does the possibly faster search make up for the possibly slow
conversion?
My testing suggests that this is not so. See program below..
unicode 3.69167995453
unicode re 3.7019649744
utf-8 6.15347003937
(unicode vs unicode re shows that it makes no difference that the RE string
is given in a unicode or 8-bit string)
Jeff
import time, re
us = "\u0100" * 1000 * 1000
r = re.compile('[%s]' % ''.join([chr(x) for x in range(0, 9) + range(14, 32)]))
ru = re.compile(u'[%s]' % ''.join([chr(x) for x in range(0, 9) + range(14, 32)]))
t = time.time()
for i in range(3):
r.search(us)
print "unicode", time.time() - t
t = time.time()
for i in range(3):
ru.search(us)
print "unicode re", time.time() - t
t = time.time()
for i in range(3):
s = us.encode("utf-8")
r.search(s)
print "utf-8", time.time() - t
More information about the Python-list
mailing list