Enabling the use of POSIX character classes in Python

Martin v. Loewis martin at v.loewis.de
Sat Dec 11 12:55:42 EST 2010


Am 11.12.2010 18:33, schrieb Perry Johnson:
> Python's re module does not support POSIX character classes, for
> example [:alpha:]. It is, of course, trivial to simulate them using
> character ranges when the text to be matched uses the ASCII character
> set. Sadly, my problem is that I need to process Unicode text. The re
> module has its own character classes that do support Unicode, however
> they are not sufficient.
> 
> I would find it extremely useful if there was information on the
> Unicode code points that map to each of the POSIX character classes.

By definition, this is not possible. The POSIX character classes are
locale-dependent, whereas the recommendation for Unicode regular
expressions is that they are not (i.e. a Unicode regex character class
should refer to the same characters independent from the locale).

If you want to construct locale-dependent Unicode character classes,
you should use this procedure:
- iterate over all byte values (0..255)
- perform the relevant locale-specific tests
- decode each byte into Unicode, using the locale's encoding
- construct a character class out of that

Unfortunately, that will work only for single-byte encodings.
I'm not aware of a procedure that does that for multi-byte strings.

But perhaps you didn't mean "POSIX character class" in this literal
way.

Regards,
Martin




More information about the Python-list mailing list