Unicode regex and Hindi language

Sat Nov 29 05:11:01 EST 2008

On Nov 29, 10:51 am, MRAB <goo... at mrabarnett.plus.com> wrote:
> John Machin wrote:
> > On Nov 29, 2:47 am, Shiao <multis... at gmail.com> wrote:
> >> The regex below identifies words in all languages I tested, but not in
> >> Hindi:
>
> >> pat = re.compile('^(\w+)$', re.U)
> >> ...
> >>    m = pat.search(l.decode('utf-8'))
> > [example snipped]
> >> From this is assumed that the Hindi text contains punctuation or other
> >> characters that prevent the word match.
>
> > This appears to be a bug in Python, as others have pointed out. Two
> > points not covered so far:
>
> Well, not so much a bug as a lack of knowledge.

It's a bug. See below.

> > (1) Instead of search() with pattern ^blahblah, use match() with
> > pattern blahblah -- unless it has been fixed fairly recently, search()
> > doesn't notice that the ^ means that it can give up when failure
> > occurs at the first try; it keeps on trying futilely at the 2nd,
> > 3rd, .... positions.
>
> > (2) "identifies words": \w+ (when fixed) matches a sequence of one or
> > more characters that could appear *anywhere* in a word in any language
> > (including computer languages). So it not only matches words, it also
> > matches non-words like '123' and '0x000' and '0123_' and 10 viramas --
> > in other words, you may need to filter out false positives. Also, in
> > some languages (e.g. Chinese) a "word" consists of one or more
> > characters and there is typically no spacing between "words"; \w+ will
> > identify whole clauses or sentences.
>
> This is down to the definition of "word character".

What is "This"? The two additional points I'm making have nothing to
do with \w.

> Should \w match Mc
> characters? Should \w match a single character or a non-combining
> character with any combining characters, ie just Lo or Lo, Lo+Mc,
> Lo+Mc+Mc, etc?

Huh? I thought it was settled. Read Terry Ready's latest message. Read
the bug report it points to (http://bugs.python.org/issue1693050),
especially the contribution from MvL. To paraphrase a remark by the
timbot, Martin reads Unicode tech reports so that we don't have to.
However if you are a doubter or have insomnia, read http://unicode.org/reports/tr18/

Cheers,
John