Grapheme clusters, a.k.a.real characters

Wed Jul 19 11:45:24 EDT 2017

Chris Angelico <rosuav at gmail.com>:

> Now, this is a performance question, and it's not unreasonable to talk
> about semantics first and let performance wait for later. But when you
> consider how many ASCII-only strings Python uses internally (the names
> of basically every global function and every attribute in every stdlib
> module), and how you'll be enlarging those by a factor of 16 *and*
> making every character lookup require two pointer reads, it's pretty
> much a non-starter.

It's not that difficult or costly.

> Also, this system has the nasty implication that the creation of a new
> combining character will fundamentally change the way a string
> behaves.

If you go with a new Text class, you don't face any
backward-compatibility issues.

If you go with expanding str, you can run into some minor issues.

> But if combining characters behave fundamentally differently to
> others, there would be a change in string representation when U+1DF6
> became a combining character. That's going to cause MASSIVE upheaval.
> I don't think there's any solution to that, but if you can find one,
> do please elaborate.

So let's assume we will expand str to accommodate the requirements of
grapheme clusters.

All existing code would still produce only traditional strings. The only
way to introduce the new "super code points" is by invoking the
str.canonical() method:

    text = "hyvää yötä".canonical()

In this case text would still be a fully traditional string because both
ä and ö are represented by a single code point in NFC. However:

    >>> q = unicodedata.normalize("NFC", "aq̈u")
    >>> len(q)
    4
    >>> text = q.canonical()
    >>> len(text)
    3
    >>> t[0]
    "a"
    >>> t[1]
    "q̈"
    >>> t[2]
    "u"
    >>> q2 = unicodedata.normalize("NFC", text)
    >>> len(q2)
    4
    >>> text.encode()
    b'aq\xcc\x88u'
    >>> q.encode()
    b'aq\xcc\x88u'

We *could* also add a literal notation for canonical strings:

    >>> re.match(rc"[qq̈]x", c"q̈x")
    ...

Of course, str.canonical() could be expressed as:

    >>> len(unicode.normalize("Python-Canonical", q))
    3

but I think str.canonical() would deserve a place in the, well, canon.

Marko