regular expressions and internationalization (WAS: permuting letters...)

Fri Nov 12 18:50:04 EST 2004

Steven Bethard wrote:
> I looked again at the re module, and it seems that \w and \W do have
> internationalization support... Is there any way to match \w but not \d?  Maybe
> something like:
>     r'[^\d\W]{4,}'
> 
> This seems to work (maybe?):

Yeah, I tried that originally but noticed that digits and '_'
are included, which ruined the idea of the scramble.  So I
opted with the OP's choice and hard-coded just the letters.

>>>>p = re.compile(r'[^\d\W]{4,}', re.UNICODE)

Nice way to do it.  I would also put '_' in that exclude list.

 > I don't know how to check how this works in different
 > locales though...

I think I don't understand locales well enough either.  It
looks like I need to use re.UNICODE more often.

 >>> import string
 >>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
 >>> import re
 >>> pat = re.compile(r"[^\d^\W]{4,}", re.UNICODE)
 >>> GOT = u"G\N{LATIN SMALL LETTER O WITH DIAERESIS}teborg"
 >>> print GOT.encode("utf8")
Göteborg
 >>> pat.search(GOT).group(0)
u'G\xf6teborg'
 >>> pat = re.compile(r"[^\d^\W]{4,}")
 >>> pat.search(GOT).group(0)
u'teborg'
 >>>
 >>> import locale
 >>> locale.setlocale(locale.LC_ALL, "")
'C'
 >>>

So you can see that re.UNICODE uses the Unicode definition
of what is a letter despite the locale being C.

Therefore, I think your approach is better.

				Andrew
				dalke at dalkescientific.com