Finding Upper-case characters in regexps, unicode friendly.

John Machin sjmachin at lexicon.net
Wed May 24 18:49:01 EDT 2006


On 25/05/2006 5:43 AM, possibilitybox at gmail.com wrote:
> I'm trying to make a unicode friendly regexp to grab sentences
> reasonably reliably for as many unicode languages as possible, focusing
> on european languages first, hence it'd be useful to be able to refer
> to any uppercase unicode character instead of just the typical [A-Z],
> which doesn't include, for example É.   Is there a way to do this, or
> do I have to stick with using the isupper method of the string class?
> 

You have set yourself a rather daunting task.

:-)
je suis ici a vous dire grandpere que maintenant nous ecrivons sans 
accents sans majuscules sans ponctuation sans tout vive le sms vive la 
revolution les professeurs a la lanterne ah m**** pas des lanternes
(-:

I would have thought that a full-on NLP parser might be required, even 
for more-or-less-conventionally-expressed utterances. How will you 
handle "It's not elementary, Dr. Watson."?

However if you persist: there appears to be no way of specifying "an 
uppercase character" in Python's re module. You are stuck with isupper().

Light entertainment for the speed-freaks:
 >>> ucucase = set(unichr(i) for i in range(65536) if unichr(i).isupper())
 >>> len(ucucase)
704

Is foo in ucucase faster than foo.isupper()?

Cheers,
John






More information about the Python-list mailing list