Unicode regex and Hindi language

Fri Nov 28 14:41:19 EST 2008

Terry Reedy wrote:
> Jerry Hill wrote:
>> On Fri, Nov 28, 2008 at 10:47 AM, Shiao <multiseed at gmail.com> wrote:
>>> The regex below identifies words in all languages I tested, but not in
>>> Hindi:
>>>
>>> # -*- coding: utf-8 -*-
>>>
>>> import re
>>> pat = re.compile('^(\w+)$', re.U)
>>> langs = ('English', '中文', 'हिन्दी')
>>
>> I think the problem is that the Hindi Text contains both alphanumeric
>> and non-alphanumeric characters.  I'm not very familiar with Hindi,
>> much less how it's held in unicode, but take a look at the output of
>> this code:
>>
>> # -*- coding: utf-8 -*-
>> import unicodedata as ucd
>>
>> langs = (u'English', u'中文', u'हिन्दी')
>> for lang in langs:
>>     print lang
>>     for char in lang:
>>         print "\t %s %s (%s)" % (char, ucd.name(char), 
>> ucd.category(char))
>>
>> Output:
>>
>> English
>>      E LATIN CAPITAL LETTER E (Lu)
>>      n LATIN SMALL LETTER N (Ll)
>>      g LATIN SMALL LETTER G (Ll)
>>      l LATIN SMALL LETTER L (Ll)
>>      i LATIN SMALL LETTER I (Ll)
>>      s LATIN SMALL LETTER S (Ll)
>>      h LATIN SMALL LETTER H (Ll)
>> 中文
>>      中 CJK UNIFIED IDEOGRAPH-4E2D (Lo)
>>      文 CJK UNIFIED IDEOGRAPH-6587 (Lo)
>> हिन्दी
>>      ह DEVANAGARI LETTER HA (Lo)
>>      ि DEVANAGARI VOWEL SIGN I (Mc)
>>      न DEVANAGARI LETTER NA (Lo)
>>      ् DEVANAGARI SIGN VIRAMA (Mn)
>>      द DEVANAGARI LETTER DA (Lo)
>>      ी DEVANAGARI VOWEL SIGN II (Mc)
>>
>> From that, we see that there are some characters in the Hindi string
>> that aren't letters (they're not in unicode category L), but are
>> instead marks (unicode category M).
> 
> Python3.0 allows unicode identifiers.  Mn and Mc characters are included 
>  in the set of allowed alphanumeric characters.  'Hindi' is a word in 
> both its native characters and in latin tranliteration.
> 
> http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords 
> 
> 
> re is too restrictive in its definition of 'word'. I suggest that OP 
> (original poster) Shiao file a bug report at http://bugs.python.org
> 
Should the Mc and Mn codepoints match \w in the re module even though 
u'हिन्दी'.isalpha() returns False (in Python 2.x, haven't tried Python 
3.x)? Issue 1693050 said no. Perhaps someone with knowledge of Hindi 
could suggest how Python should handle it. I wouldn't want the re module 
to say one thing and the rest of the language to say another! :-)