Unicode regex and Hindi language

Terry Reedy tjreedy at udel.edu
Sat Nov 29 18:13:40 EST 2008


John Machin wrote:

John, nothing I wrote was directed at you.  If you feel insulted, you 
have my apology.  My intention was and is to get future movement on an 
issue that was reported 20 months ago but which has lain dead since, 
until re-reported (a bit more clearly) a week ago, because of a 
misunderstanding by the person who (I believe) rewrote re for unicode 
several years ago.

> Like this:
> 
> | >>> w1 = u"L\N{LATIN SMALL LETTER O WITH DIAERESIS}wis"
> | >>> w2 = u"Lo\N{COMBINING DIAERESIS}wis"
> | >>> w1
> | u'L\xf6wis'
> | >>> w2
> | u'Lo\u0308wis'
> | >>> import unicodedats as ucd
> | >>> ucd.category(u'\u0308')
> | 'Mn'
> | >>> u'\u0308'.isalpha()
> | False
> | >>> regex = re.compile(ur'\w+', re.UNICODE)
> | >>> regex.match(w1).group(0)
> | u'L\xf6wis'
> | >>> regex.match(w2).group(0)
> | u'Lo'

Yes, thank you.  FWIW, that confirms my suspicion.

Terry




More information about the Python-list mailing list