[regex] case-splitting strings in unicode

Sat Oct 8 21:25:16 EDT 2005

On Oct 09, John Perks and Sarah Mount wrote:
> I have to split some identifiers that are casedLikeThis into their
> component words. In this instance I can safely use [A-Z] to represent
> uppercase, but what pattern should I use if I wanted it to work more
> generally? I can envisage walking the string testing the
> unicodedata.category of each char, but is there a regex'y way to
> denote "uppercase"?

Not sure what your output should look like but something like this could
work:

>>> import re
>>> re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest')
'the First Test the Second Test'

This can be adapted for multiline, etc, but maybe '[A-Z]' is
sufficiently general.  The regex module does have an understanding of
unicode (but I don't, sorry); you could add (?u) make it unicode aware.
For programming language identifiers I wouldn't think that unicode
should be an issue.  Sorry I'm no help with unicode specifics.

Some useful links:

http://www.python.org/doc/2.4.2/lib/module-re.html
http://www.amk.ca/python/howto/regex/regex.html

-- 
Micah Elliott
<mde at micah.elliott.name>