regexps with unicode-aware characterclasses?
"Martin v. Löwis"
martin at v.loewis.de
Wed Sep 14 02:09:03 EDT 2005
Stefan Rank wrote:
> <wishful thinking>
>
> re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))
This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do
uppers = [u'[']
for i in range(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)
Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.
(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)
> for the latter two, to work on utf-8 strings, would I have to set the
> defaultencoding to utf-8?
For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.
Regards,
Martin
More information about the Python-list
mailing list