Unicode regex and Hindi language

Fri Nov 28 16:15:14 EST 2008

MRAB wrote:

> Should the Mc and Mn codepoints match \w in the re module even though 
> u'हिन्दी'.isalpha() returns False (in Python 2.x, haven't tried Python 
> 3.x)? 

Same.  And to me, that is wrong. The condensation of vowel characters 
(which Hindi, etc, also have for words that begin with vowels) to 'vowel 
marks' attached to the previous consonant does change their nature as 
indications of speech sounds.  The difference is purely graphical.

> Issue 1693050 said no.
The full url
http://bugs.python.org/issue1693050
would have been nice, but thank you for finding this.  I search but 
obviously not with the right word.  In any case, this issue is still 
open.  MAL is wrong about at least Mc and Mn.  I will explain there also.

 > Perhaps someone with knowledge of Hindi
> could suggest how Python should handle it.

Recognize that vowel are parts of words, as it already does for identifiers.

> I wouldn't want the re module 
> to say one thing and the rest of the language to say another! :-)

I will add a note about .isapha

Terry Jan Reedy