Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Fri Jul 21 04:05:18 EDT 2017


On Fri, Jul 21, 2017 at 4:34 PM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> On Fri, 21 Jul 2017 01:43 pm, Chris Angelico wrote:
>
>> Strings with all code
>> points on the BMP and no combining characters are still able to be
>> represented as they are today, again with the empty secondary array.
>
> I presume that since the problem we're trying to solve here is that certain
> characters have two representations, this format will automatically decompose
> strings. Otherwise, it doesn't really solve the problems with diacritics, where
> a single human-readable character like é or ö has two distinct, and non-equal,
> representations.
>
> But if it does, then every string with a diacritic (i.e. most Western European
> text, if not Eastern European as well) will need combining characters.
>
> If this *doesn't* decompose the strings, then what problem is it actually
> solving?

I'm honestly not sure, though I had been assuming that it was capable
of representing composed OR decomposed strings. If it does decompose
everything, then yeah, a lot more will need the secondary array.

>> Similarly, the secondary array will only VERY rarely need to contain
>> any pointers; most combined characters consist of a base and one
>> combining, or a set of three characters at most.
>
> I don't know if you can make that claim for non-West European languages. I don't
> know enough about (for example) Slavic languages, or Thai, or Arabic, or
> Chinese, to know whether (base + three combining characters) will be rare or
> not.

Not sure, but what I usually see is that one Chinese character gets
one Unicode codepoint. But again, forcible decomposition may change
this.

> But emoji sequences will often require four code points, three of which will be
> in the supplementary planes.
>
> http://unicode.org/emoji/charts/emoji-zwj-sequences.html

"Often"? I doubt that; a lot of emoji don't require that many.

>> There'll be dramatic
>> performance costs for strings where piles of combining characters get
>> loaded on top of a single base, but at least they can be accurately
>> represented.
>
> They can be accurately represented right now. E.g. there is nothing ambiguous or
> inaccurate about U+1F469 U+1F3FD U+200D U+1F52C, "woman scientist with medium
> skin tone".

I may have elided a bit too much here. Let's start with a simpler
representation: a string is represented as a tuple of Python integer
objects, each of which uses the original scheme. Now, that's able to
represent everything, but it's stupidly expensive. The original
multi-tiered scheme gives vast improvements for everything other than
this case, but at least it doesn't make them unrepresentable (cf
UCS-2).

ChrisA



More information about the Python-list mailing list