Unicode regex and Hindi language

Fri Nov 28 11:29:21 EST 2008

Shiao wrote:

> The regex below identifies words in all languages I tested, but not in
> Hindi:
> 
> # -*- coding: utf-8 -*-
> 
> import re
> pat = re.compile('^(\w+)$', re.U)
> langs = ('English', '中文', 'हिन्दी')
> 
> for l in langs:
>     m = pat.search(l.decode('utf-8'))
>     print l, m and m.group(1)
> 
> Output:
> 
> English English
> 中文 中文
> हिन्दी None
> 
> From this is assumed that the Hindi text contains punctuation or other
> characters that prevent the word match. Now, even more alienating is
> this:
> 
> pat = re.compile('^(\W+)$', re.U) # note: now \W
> 
> for l in langs:
>     m = pat.search(l.decode('utf-8'))
>     print l, m and m.group(1)
> 
> Output:
> 
> English None
> 中文 None
> हिन्दी None
> 
> How can the Hindi be both not a word and "not not a word"??
> 
> Any clue would be much appreciated!

It's not a word, but that doesn't mean that it consists entirely of
non-alpha characters either. Here's what Python gets to see:

>>> langs[2]
u'\u0939\u093f\u0928\u094d\u0926\u0940'
>>> from unicodedata import name
>>> for c in langs[2]:
...     print repr(c), name(c), ["non-alpha", "ALPHA"][c.isalpha()]
...
u'\u0939' DEVANAGARI LETTER HA ALPHA
u'\u093f' DEVANAGARI VOWEL SIGN I non-alpha
u'\u0928' DEVANAGARI LETTER NA ALPHA
u'\u094d' DEVANAGARI SIGN VIRAMA non-alpha
u'\u0926' DEVANAGARI LETTER DA ALPHA
u'\u0940' DEVANAGARI VOWEL SIGN II non-alpha

Peter