[regex] case-splitting strings in unicode

Sun Oct 9 12:36:02 EDT 2005

John Perks and Sarah Mount wrote:
> I have to split some identifiers that are casedLikeThis into their
> component words. In this instance I can safely use [A-Z] to represent
> uppercase, but what pattern should I use if I wanted it to work more
> generally? I can envisage walking the string testing the
> unicodedata.category of each char, but is there a regex'y way to denote
> "uppercase"?

In this form, it is currently not implemented, although it should be
(written as [[:upper:]], I believe); contributions are welcome (make
sure you read the Unicode consortium's guidelines on regular expressions
before attempting to implement it).

Until then, the "best" way is to use a regular character class,
precomputed or computed at runtime.

uni_upper = [unichr(i) for i in range(sys.maxunicode) if 
unichr(i).isupper()]
uni_re = u"["+u"".join(uni_upper)+u"]"

On my machine, this takes approximately one second to compute,
which may or may not be too much as a startup cost. To speed
this up, you could dump the resulting uni_re into a Python
source file.

Regards,
Martin