unicode categories -- regex

Sat Sep 22 13:29:26 EDT 2007

> So how do i include this information in regular pattern search? Any
> ideas?

At the moment, you have to generate a character class for this yourself,
e.g.

py> chars = [unichr(i) for i in range(sys.maxunicode)]
py> chars = [c for c in chars if unicodedata.category(c)=='Po']
py> expr = u'[\\' + u'\\'.join(chars)+"]"
py> expr = re.compile(expr)
py> expr.match(u"#")
<_sre.SRE_Match object at 0xb7ce1d40>
py> expr.match(u"a")
py> expr.match(u"\u05be")
<_sre.SRE_Match object at 0xb7ce1d78>

Creating this expression is fairly expensive, however, once compiled,
it has a compact representation in memory, and matching it is
efficient.

Contributions to support categories directly in re are welcome. Look
at the relevant Unicode recommendation on how to do that.

HTH,
Martin