Unicode regex and Hindi language

Sat Nov 29 17:41:15 EST 2008

On Nov 30, 4:33 am, Terry Reedy <tjre... at udel.edu> wrote:
> Martin v. Löwis wrote:
> > To be fair to Python (and SRE),

I was being unfair? In the context, "bug" == "needs to be changed";
see below.

> SRE predates TR#18 (IIRC) - atleast
> > annex C was added somewhere between revision 6 and 9, i.e. in early
> > 2004. Python's current definition of \w is a straight-forward extension
> > of the historical \w definition (of Perl, I believe), which,
> > unfortunately, fails to recognize some of the Unicode subtleties.
>
> I agree about not dumping on the past.

Dumping on the past?? I used the term "bug" in the same sense as you
did: "I suggest that OP (original poster) Shiao file a bug report at
http://bugs.python.org".

>  When unicode support was added
> to re, it was a somewhat experimental advance over bytes-only re.  Now
> that Python has spread to south Asia as well as east Asia, it is time to
> advance it further.  I think this is especially important for 3.0, which
> will attract such users with the option of native identifiers.  Re
> should be able to recognize Python identifiers as words.  I care not
> whether the patch is called a fix or an update.
>
> I have no personal need for this at the moment but it just happens that
> I studied Sanskrit a bit some years ago and understand the script and
> could explain why at least some 'marks' are really 'letters'.  There are
> several other south Asian scripts descended from Devanagari, and
> included in Unicode, that use the same or similar vowel mark system.  So
> updating Python's idea of a Unicode word will help users of several
> languages and make it more of a world language.
>
> I presume that not viewing letter marks as part of words would affect
> Hebrew and Arabic also.
>
> I wonder if the current rule also affect European words with accents
> written as separate marks instead of as part of combined characters.
> For instance, if Martin's last name is written 'L' 'o' 'diaresis mark'
> 'w' 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5
> chars), is it still recognized as a word?  (I don't know how to do the
> input to do the test.)

Like this:

| >>> w1 = u"L\N{LATIN SMALL LETTER O WITH DIAERESIS}wis"
| >>> w2 = u"Lo\N{COMBINING DIAERESIS}wis"
| >>> w1
| u'L\xf6wis'
| >>> w2
| u'Lo\u0308wis'
| >>> import unicodedats as ucd
| >>> ucd.category(u'\u0308')
| 'Mn'
| >>> u'\u0308'.isalpha()
| False
| >>> regex = re.compile(ur'\w+', re.UNICODE)
| >>> regex.match(w1).group(0)
| u'L\xf6wis'
| >>> regex.match(w2).group(0)
| u'Lo'