Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Tue Jul 18 14:56:06 EDT 2017


Chris Angelico <rosuav at gmail.com>:

> On Wed, Jul 19, 2017 at 4:31 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Chris Angelico <rosuav at gmail.com>:
>>
>>> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>>>> Yes. Also, not every letter can be normalized to a single codepoint so
>>>> NFC is not a way out. For example,
>>>>
>>>>     re.match("^[q̈]$", "q̈")
>>>>
>>>> returns None regardless of normalization.
> [...]
>
> What I *think* you're asking for is for square brackets in a regex to
> count combining characters with their preceding base character.

Yes. My example tries to match a single character against a single
character.

> That would make a lot of sense, and would actually be a reasonable
> feature to request. (Probably as an option, in case there's a backward
> compatibility issue.)

There's the flag re.IGNORECASE. In the same vein, it might be useful to
have re.IGNOREDIACRITICS, which would match

   re.match("^[abc]$", "ä", re.IGNOREDIACRITICS)

regardless of normalization.


Marko



More information about the Python-list mailing list