Unicode regex and Hindi language

Fri Nov 28 11:36:17 EST 2008

On Fri, Nov 28, 2008 at 10:47 AM, Shiao <multiseed at gmail.com> wrote:
> The regex below identifies words in all languages I tested, but not in
> Hindi:
>
> # -*- coding: utf-8 -*-
>
> import re
> pat = re.compile('^(\w+)$', re.U)
> langs = ('English', '中文', 'हिन्दी')

I think the problem is that the Hindi Text contains both alphanumeric
and non-alphanumeric characters.  I'm not very familiar with Hindi,
much less how it's held in unicode, but take a look at the output of
this code:

# -*- coding: utf-8 -*-
import unicodedata as ucd

langs = (u'English', u'中文', u'हिन्दी')
for lang in langs:
    print lang
    for char in lang:
        print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char))

Output:

English
	 E LATIN CAPITAL LETTER E (Lu)
	 n LATIN SMALL LETTER N (Ll)
	 g LATIN SMALL LETTER G (Ll)
	 l LATIN SMALL LETTER L (Ll)
	 i LATIN SMALL LETTER I (Ll)
	 s LATIN SMALL LETTER S (Ll)
	 h LATIN SMALL LETTER H (Ll)
中文
	 中 CJK UNIFIED IDEOGRAPH-4E2D (Lo)
	 文 CJK UNIFIED IDEOGRAPH-6587 (Lo)
हिन्दी
	 ह DEVANAGARI LETTER HA (Lo)
	 ि DEVANAGARI VOWEL SIGN I (Mc)
	 न DEVANAGARI LETTER NA (Lo)
	 ् DEVANAGARI SIGN VIRAMA (Mn)
	 द DEVANAGARI LETTER DA (Lo)
	 ी DEVANAGARI VOWEL SIGN II (Mc)

From that, we see that there are some characters in the Hindi string
that aren't letters (they're not in unicode category L), but are
instead marks (unicode category M).