Grapheme clusters, a.k.a.real characters

Steven D'Aprano steve at pearwood.info
Wed Jul 19 02:49:13 EDT 2017


On Wed, 19 Jul 2017 17:51:49 +1200, Gregory Ewing wrote:

> Chris Angelico wrote:
>> Once you NFC or NFD normalize both strings, identical strings will
>> generally have identical codepoints... You should then be able to use
>> normal regular expressions to match correctly.
> 
> Except that if you want to match a set of characters,
> you can't reliably use [...], you would have to write them out as
> alternatives in case some of them take up more than one code point.

Good point!

A quibble -- there's no "in case" here, since you, the programmer, will 
always know whether they have a single code point form or not. If you're 
unsure, look it up, or call unicodedata.normalize().

(Yeah, right, like the average coder will remember to do this...)

Nevertheless, although it might be annoying and tricky, regexes *are* 
flexible enough to deal with this problem. After all, you can't use [th] 
to match "th" as a unit either, and regex set character set notation 
[abcd] is logically equivalent to (a|b|c|d).

I wonder how Perl 6 has solved this problem? They seem to be much more 
advanced when it comes to dealing with Unicode.

The *really* tricky part is if you receive a string from the user 
intended as a regular expression. If they provide

[xyzã]

as part of a regex, and you receive ã in denormalized form

U+0061 LATIN SMALL LETTER A + U+0303 COMBINING TILDE

you can't be sure that they actually intended:

U+00E3 LATIN SMALL LETTER A WITH TILDE

maybe they're smarter than you think and they actually do mean 

[xyza\N{COMBINING TILDE}] = (x|y|z|a|\N{COMBINING TILDE})


-- 
Steve



More information about the Python-list mailing list