[regex] case-splitting strings in unicode
Micah Elliott
mde at micah.elliott.name
Sat Oct 8 21:25:16 EDT 2005
On Oct 09, John Perks and Sarah Mount wrote:
> I have to split some identifiers that are casedLikeThis into their
> component words. In this instance I can safely use [A-Z] to represent
> uppercase, but what pattern should I use if I wanted it to work more
> generally? I can envisage walking the string testing the
> unicodedata.category of each char, but is there a regex'y way to
> denote "uppercase"?
Not sure what your output should look like but something like this could
work:
>>> import re
>>> re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest')
'the First Test the Second Test'
This can be adapted for multiline, etc, but maybe '[A-Z]' is
sufficiently general. The regex module does have an understanding of
unicode (but I don't, sorry); you could add (?u) make it unicode aware.
For programming language identifiers I wouldn't think that unicode
should be an issue. Sorry I'm no help with unicode specifics.
Some useful links:
http://www.python.org/doc/2.4.2/lib/module-re.html
http://www.amk.ca/python/howto/regex/regex.html
--
Micah Elliott
<mde at micah.elliott.name>
More information about the Python-list
mailing list