Unicode regex and Hindi language

Sat Nov 29 18:13:40 EST 2008

John Machin wrote:

John, nothing I wrote was directed at you.  If you feel insulted, you 
have my apology.  My intention was and is to get future movement on an 
issue that was reported 20 months ago but which has lain dead since, 
until re-reported (a bit more clearly) a week ago, because of a 
misunderstanding by the person who (I believe) rewrote re for unicode 
several years ago.

> Like this:
> 
> | >>> w1 = u"L\N{LATIN SMALL LETTER O WITH DIAERESIS}wis"
> | >>> w2 = u"Lo\N{COMBINING DIAERESIS}wis"
> | >>> w1
> | u'L\xf6wis'
> | >>> w2
> | u'Lo\u0308wis'
> | >>> import unicodedats as ucd
> | >>> ucd.category(u'\u0308')
> | 'Mn'
> | >>> u'\u0308'.isalpha()
> | False
> | >>> regex = re.compile(ur'\w+', re.UNICODE)
> | >>> regex.match(w1).group(0)
> | u'L\xf6wis'
> | >>> regex.match(w2).group(0)
> | u'Lo'

Yes, thank you.  FWIW, that confirms my suspicion.

Terry