Unicode regex and Hindi language

Shiao multiseed at gmail.com
Fri Nov 28 10:47:23 EST 2008


The regex below identifies words in all languages I tested, but not in
Hindi:

# -*- coding: utf-8 -*-

import re
pat = re.compile('^(\w+)$', re.U)
langs = ('English', '中文', 'हिन्दी')

for l in langs:
    m = pat.search(l.decode('utf-8'))
    print l, m and m.group(1)

Output:

English English
中文 中文
हिन्दी None

>From this is assumed that the Hindi text contains punctuation or other
characters that prevent the word match. Now, even more alienating is
this:

pat = re.compile('^(\W+)$', re.U) # note: now \W

for l in langs:
    m = pat.search(l.decode('utf-8'))
    print l, m and m.group(1)

Output:

English None
中文 None
हिन्दी None

How can the Hindi be both not a word and "not not a word"??

Any clue would be much appreciated!

Best.




More information about the Python-list mailing list