Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Tue Jul 18 15:32:09 EDT 2017


On Wed, Jul 19, 2017 at 4:56 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>> What I *think* you're asking for is for square brackets in a regex to
>> count combining characters with their preceding base character.
>
> Yes. My example tries to match a single character against a single
> character.
>
>> That would make a lot of sense, and would actually be a reasonable
>> feature to request. (Probably as an option, in case there's a backward
>> compatibility issue.)
>
> There's the flag re.IGNORECASE. In the same vein, it might be useful to
> have re.IGNOREDIACRITICS, which would match
>
>    re.match("^[abc]$", "ä", re.IGNOREDIACRITICS)
>
> regardless of normalization.

That's a different feature, and can be achieved with a different normalization:

def fold(s):
    """Fold a string for 'search compatibility'.

    Returns a modified version of s with no diacriticals.
    """
    s = s.casefold()
    s = unicodedata.normalize("NFKD", s)
    s = ''.join(c for c in s if c < '\u0300' or c > '\u033f')
    return unicodedata.normalize("NFKC", s)

This is something that you might use when searching, as people will
expect to be able to type "cafe" to fine "café". It is deliberately
lossy.

But having the re module group code units into logical characters
according to 'base + combining' is a different feature. It may be
worth adding. I don't think your re.IGNOREDIACRITICS is something that
belongs in the stdlib, as different search contexts require different
folding (Google, for instance, will find "ı" when you search for "i" -
but then, Google also finds "python" when you search for "phyton").

ChrisA



More information about the Python-list mailing list