Unicode regex and Hindi language

MRAB google at mrabarnett.plus.com
Sat Nov 29 13:20:50 EST 2008


Terry Reedy wrote:
> Martin v. Löwis wrote:
> 
>> To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
>> annex C was added somewhere between revision 6 and 9, i.e. in early
>> 2004. Python's current definition of \w is a straight-forward extension
>> of the historical \w definition (of Perl, I believe), which,
>> unfortunately, fails to recognize some of the Unicode subtleties.
> 
> I agree about not dumping on the past.  When unicode support was added 
> to re, it was a somewhat experimental advance over bytes-only re.  Now 
> that Python has spread to south Asia as well as east Asia, it is time to 
> advance it further.  I think this is especially important for 3.0, which 
> will attract such users with the option of native identifiers.  Re 
> should be able to recognize Python identifiers as words.  I care not 
> whether the patch is called a fix or an update.
> 
> I have no personal need for this at the moment but it just happens that 
> I studied Sanskrit a bit some years ago and understand the script and 
> could explain why at least some 'marks' are really 'letters'.  There are 
> several other south Asian scripts descended from Devanagari, and 
> included in Unicode, that use the same or similar vowel mark system.  So 
> updating Python's idea of a Unicode word will help users of several 
> languages and make it more of a world language.
> 
> I presume that not viewing letter marks as part of words would affect 
> Hebrew and Arabic also.
> 
> I wonder if the current rule also affect European words with accents 
> written as separate marks instead of as part of combined characters. For 
> instance, if Martin's last name is written 'L' 'o' 'diaresis mark' 'w' 
> 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5 
> chars), is it still recognized as a word?  (I don't know how to do the 
> input to do the test.)
> 
> I notice from the manual "All identifiers are converted into the normal 
> form NFC while parsing; comparison of identifiers is based on NFC."  If 
> NFC used accented letters, then the issue is finesses away for European 
> words simply because Unicode includes includes combined characters for 
> European scripts but not for south Asian scripts.
> 
Does that mean that the re module will need to convert both the pattern 
and the text to be searched into NFC form first? And I'm still not clear 
whether \w, when used on a string consisting of Lo followed by Mc, 
should match Lo and then Mc (one codepoint at a time) or together (one 
character at a time, where a character consists of some base character 
codepoint possibly followed by modifier codepoints).

I ask because I'm working on the re module at the moment.



More information about the Python-list mailing list