Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Tue Jul 18 12:44:52 EDT 2017


On Wed, Jul 19, 2017 at 12:09 AM, Random832 <random832 at fastmail.com> wrote:
> On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote:
>> What do you mean about regular expressions? You can use REs with
>> normalized strings. And if you have any valid definition of "real
>> character", you can use it equally on an NFC-normalized or
>> NFD-normalized string than any other. They're just strings, you know.
>
> I don't understand how normalization is supposed to help with this. It's
> not like there aren't valid combinations that do not have a
> corresponding single NFC codepoint (to say nothing of the situation with
> e.g. Indic languages).
>
> In principle probably a viable solution for regex would be to add
> character classes for base and combining characters, and then
> "[[:base:]][[:combining:]]*" can be used as a building block if
> necessary.

Once you NFC or NFD normalize both strings, identical strings will
generally have identical codepoints. (There are some exceptions, and
for certain types of matching, you might want to use NFKC/NFKD
instead.) You should then be able to use normal regular expressions to
match correctly. I don't know of any situations where you want to
match "any base character" or "any combining character"; what you're
more likely to want is "match the letter á", and you don't care
whether it's represented as U+0061 U+0301 or as U+00E1. That's where
Unicode normalization comes in.

ChrisA



More information about the Python-list mailing list