Finding Upper-case characters in regexps, unicode friendly.

Wed May 24 16:36:24 EDT 2006

> I'm trying to make a unicode friendly regexp to grab sentences
> reasonably reliably for as many unicode languages as
> possible, focusing on european languages first, hence it'd be
> useful to be able to refer to any uppercase unicode character
> instead of just the typical [A-Z], which doesn't include, for
> example É.   Is there a way to do this, or do I have to stick
> with using the isupper method of the string class?

Well, assuming you pass in the UNICODE or LOCALE specifier, the 
following portion of a regexp *should* find what you're describing:

###############################################
import re
tests = [("1", False),
	("a", True),
	("Hello", True),
	("2bad", False),
	("bad1", False),
	("a c", False)
	]
r = re.compile(r'^(?:(?=\w)[^\d_])*$')
for test, expected_result in tests:
     if r.match(test):
         passed = expected_result
     else:
         passed = not expected_result
     print "[%s] expected [%s] passed [%s]" % (
         test, expected_result, passed)
###############################################

That looks for a "word" character ("\w") but doesn't swallow it 
("(?=...)"), and then asserts that the character is not ("^") a 
digit ("\d") or an underscore.  It looks for any number of "these 
things" ("(?:...)*"), which you can tweak to your own taste.

For Unicode-ification, just pass the re.UNICODE parameter to 
compile().

Hope this makes sense and helps,

-tkc