Unicode regex and Hindi language

Terry Reedy tjreedy at udel.edu
Sat Nov 29 17:43:31 EST 2008


MRAB wrote:
> Terry Reedy wrote:

>> I notice from the manual "All identifiers are converted into the 
>> normal form NFC while parsing; comparison of identifiers is based on 
>> NFC."  If NFC used accented letters, then the issue is finesses away 
>> for European words simply because Unicode includes includes combined 
>> characters for European scripts but not for south Asian scripts.
>>
> Does that mean that the re module will need to convert both the pattern 
> and the text to be searched into NFC form first?

The quote says that Python3 internally converts all identifiers in 
source code to NFC before compiling the code, so it can properly compare 
them.  If this was purely an internal matter, this would not need to be 
said. I interpret the quote as a warning that a programmer who wants to 
compare a 3.0 string to an identifier represented as a string is 
responsible for making sure that *his* string is also in NFC.  For instance:

ident = 3
...
if 'ident' in globals(): ...

The second ident must be NFC even if the programmer prefers and 
habitually writes another form because, like it or not, the first one 
will be turned into NFC before insertion into the code object and later 
into globals().

So my thought is that re should take the strings as given, but that the 
re doc should warn about logically equal forms not matching.  (Perhaps 
it does already; I have not read it in years.)  If a text uses a 
different normalization form, which some surely will, the programmer is 
responsible for using the same in the re pattern.

> And I'm still not clear 
> whether \w, when used on a string consisting of Lo followed by Mc, 
> should match Lo and then Mc (one codepoint at a time) or together (one 
> character at a time, where a character consists of some base character 
> codepoint possibly followed by modifier codepoints).

Programs that transform text to glyphs may have to read bundles of 
codepoints before starting to output, but my guess is that re should do 
the simplest thing and match codepoint by codepoint, assuming that is 
what it currently does.  I gather that would just mean expanding the 
current definition of word char.  But I would look at TR18 and see what 
Martin says.

> I ask because I'm working on the re module at the moment.

Great.  I *think* that the change should be fairly simple

Terry Jan Reedy




More information about the Python-list mailing list