Unicode regex and Hindi language

Terry Reedy tjreedy at udel.edu
Sat Nov 29 12:33:39 EST 2008


Martin v. Löwis wrote:

> To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
> annex C was added somewhere between revision 6 and 9, i.e. in early
> 2004. Python's current definition of \w is a straight-forward extension
> of the historical \w definition (of Perl, I believe), which,
> unfortunately, fails to recognize some of the Unicode subtleties.

I agree about not dumping on the past.  When unicode support was added 
to re, it was a somewhat experimental advance over bytes-only re.  Now 
that Python has spread to south Asia as well as east Asia, it is time to 
advance it further.  I think this is especially important for 3.0, which 
will attract such users with the option of native identifiers.  Re 
should be able to recognize Python identifiers as words.  I care not 
whether the patch is called a fix or an update.

I have no personal need for this at the moment but it just happens that 
I studied Sanskrit a bit some years ago and understand the script and 
could explain why at least some 'marks' are really 'letters'.  There are 
several other south Asian scripts descended from Devanagari, and 
included in Unicode, that use the same or similar vowel mark system.  So 
updating Python's idea of a Unicode word will help users of several 
languages and make it more of a world language.

I presume that not viewing letter marks as part of words would affect 
Hebrew and Arabic also.

I wonder if the current rule also affect European words with accents 
written as separate marks instead of as part of combined characters. 
For instance, if Martin's last name is written 'L' 'o' 'diaresis mark' 
'w' 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5 
chars), is it still recognized as a word?  (I don't know how to do the 
input to do the test.)

I notice from the manual "All identifiers are converted into the normal 
form NFC while parsing; comparison of identifiers is based on NFC."  If 
NFC used accented letters, then the issue is finesses away for European 
words simply because Unicode includes includes combined characters for 
European scripts but not for south Asian scripts.

Terry Jan Reedy




More information about the Python-list mailing list