Unicode regex and Hindi language

John Machin sjmachin at lexicon.net
Fri Nov 28 18:05:54 EST 2008


On Nov 29, 2:47 am, Shiao <multis... at gmail.com> wrote:
> The regex below identifies words in all languages I tested, but not in
> Hindi:

> pat = re.compile('^(\w+)$', re.U)
> ...
>    m = pat.search(l.decode('utf-8'))
[example snipped]
>
> From this is assumed that the Hindi text contains punctuation or other
> characters that prevent the word match.

This appears to be a bug in Python, as others have pointed out. Two
points not covered so far:

(1) Instead of search() with pattern ^blahblah, use match() with
pattern blahblah -- unless it has been fixed fairly recently, search()
doesn't notice that the ^ means that it can give up when failure
occurs at the first try; it keeps on trying futilely at the 2nd,
3rd, .... positions.

(2) "identifies words": \w+ (when fixed) matches a sequence of one or
more characters that could appear *anywhere* in a word in any language
(including computer languages). So it not only matches words, it also
matches non-words like '123' and '0x000' and '0123_' and 10 viramas --
in other words, you may need to filter out false positives. Also, in
some languages (e.g. Chinese) a "word" consists of one or more
characters and there is typically no spacing between "words"; \w+ will
identify whole clauses or sentences.

Cheers,
John



More information about the Python-list mailing list