Finding Upper-case characters in regexps, unicode friendly.
John Machin
sjmachin at lexicon.net
Wed May 24 18:49:01 EDT 2006
On 25/05/2006 5:43 AM, possibilitybox at gmail.com wrote:
> I'm trying to make a unicode friendly regexp to grab sentences
> reasonably reliably for as many unicode languages as possible, focusing
> on european languages first, hence it'd be useful to be able to refer
> to any uppercase unicode character instead of just the typical [A-Z],
> which doesn't include, for example É. Is there a way to do this, or
> do I have to stick with using the isupper method of the string class?
>
You have set yourself a rather daunting task.
:-)
je suis ici a vous dire grandpere que maintenant nous ecrivons sans
accents sans majuscules sans ponctuation sans tout vive le sms vive la
revolution les professeurs a la lanterne ah m**** pas des lanternes
(-:
I would have thought that a full-on NLP parser might be required, even
for more-or-less-conventionally-expressed utterances. How will you
handle "It's not elementary, Dr. Watson."?
However if you persist: there appears to be no way of specifying "an
uppercase character" in Python's re module. You are stuck with isupper().
Light entertainment for the speed-freaks:
>>> ucucase = set(unichr(i) for i in range(65536) if unichr(i).isupper())
>>> len(ucucase)
704
Is foo in ucucase faster than foo.isupper()?
Cheers,
John
More information about the Python-list
mailing list