regexps with unicode-aware characterclasses?

Wed Sep 14 02:09:03 EDT 2005

Stefan Rank wrote:
> <wishful thinking>
> 
>   re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
  c = unichr(i)
  if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)

> for the latter two, to work on utf-8 strings, would I have to set the
> defaultencoding to utf-8?

For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin