regular expressions and internationalization (WAS: permuting letters...)
Andrew Dalke
adalke at mindspring.com
Fri Nov 12 18:50:04 EST 2004
Steven Bethard wrote:
> I looked again at the re module, and it seems that \w and \W do have
> internationalization support... Is there any way to match \w but not \d? Maybe
> something like:
> r'[^\d\W]{4,}'
>
> This seems to work (maybe?):
Yeah, I tried that originally but noticed that digits and '_'
are included, which ruined the idea of the scramble. So I
opted with the OP's choice and hard-coded just the letters.
>>>>p = re.compile(r'[^\d\W]{4,}', re.UNICODE)
Nice way to do it. I would also put '_' in that exclude list.
> I don't know how to check how this works in different
> locales though...
I think I don't understand locales well enough either. It
looks like I need to use re.UNICODE more often.
>>> import string
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> import re
>>> pat = re.compile(r"[^\d^\W]{4,}", re.UNICODE)
>>> GOT = u"G\N{LATIN SMALL LETTER O WITH DIAERESIS}teborg"
>>> print GOT.encode("utf8")
Göteborg
>>> pat.search(GOT).group(0)
u'G\xf6teborg'
>>> pat = re.compile(r"[^\d^\W]{4,}")
>>> pat.search(GOT).group(0)
u'teborg'
>>>
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'C'
>>>
So you can see that re.UNICODE uses the Unicode definition
of what is a letter despite the locale being C.
Therefore, I think your approach is better.
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list