Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Tue Jul 18 13:21:51 EDT 2017


On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> what you're more likely to want is "match the letter á", and you don't
>> care whether it's represented as U+0061 U+0301 or as U+00E1. That's
>> where Unicode normalization comes in.
>
> Yes. Also, not every letter can be normalized to a single codepoint so
> NFC is not a way out. For example,
>
>     re.match("^[q̈]$", "q̈")
>
> returns None regardless of normalization.

In what language or context would you actually want to do this?

ChrisA



More information about the Python-list mailing list