Unicode regex and Hindi language

Fri Nov 28 18:51:35 EST 2008

John Machin wrote:
> On Nov 29, 2:47 am, Shiao <multis... at gmail.com> wrote:
>> The regex below identifies words in all languages I tested, but not in
>> Hindi:
> 
>> pat = re.compile('^(\w+)$', re.U)
>> ...
>>    m = pat.search(l.decode('utf-8'))
> [example snipped]
>> From this is assumed that the Hindi text contains punctuation or other
>> characters that prevent the word match.
> 
> This appears to be a bug in Python, as others have pointed out. Two
> points not covered so far:
> 
Well, not so much a bug as a lack of knowledge.

> (1) Instead of search() with pattern ^blahblah, use match() with
> pattern blahblah -- unless it has been fixed fairly recently, search()
> doesn't notice that the ^ means that it can give up when failure
> occurs at the first try; it keeps on trying futilely at the 2nd,
> 3rd, .... positions.
> 
> (2) "identifies words": \w+ (when fixed) matches a sequence of one or
> more characters that could appear *anywhere* in a word in any language
> (including computer languages). So it not only matches words, it also
> matches non-words like '123' and '0x000' and '0123_' and 10 viramas --
> in other words, you may need to filter out false positives. Also, in
> some languages (e.g. Chinese) a "word" consists of one or more
> characters and there is typically no spacing between "words"; \w+ will
> identify whole clauses or sentences.
> 
This is down to the definition of "word character". Should \w match Mc 
characters? Should \w match a single character or a non-combining 
character with any combining characters, ie just Lo or Lo, Lo+Mc, 
Lo+Mc+Mc, etc?