Grapheme clusters, a.k.a.real characters

Steve D'Aprano steve+python at pearwood.info
Fri Jul 21 02:34:24 EDT 2017


On Fri, 21 Jul 2017 01:43 pm, Chris Angelico wrote:

> Strings with all code
> points on the BMP and no combining characters are still able to be
> represented as they are today, again with the empty secondary array.

I presume that since the problem we're trying to solve here is that certain
characters have two representations, this format will automatically decompose
strings. Otherwise, it doesn't really solve the problems with diacritics, where
a single human-readable character like é or ö has two distinct, and non-equal,
representations.

But if it does, then every string with a diacritic (i.e. most Western European
text, if not Eastern European as well) will need combining characters.

If this *doesn't* decompose the strings, then what problem is it actually
solving?


> The presence of a single combining character in the string does force
> it to be stored 32 bits per character, so there can be a price to pay.

Right -- so it's really compact for Americans, and blows out for just about
everyone else.


> Similarly, the secondary array will only VERY rarely need to contain
> any pointers; most combined characters consist of a base and one
> combining, or a set of three characters at most. 

I don't know if you can make that claim for non-West European languages. I don't
know enough about (for example) Slavic languages, or Thai, or Arabic, or
Chinese, to know whether (base + three combining characters) will be rare or
not.

But emoji sequences will often require four code points, three of which will be
in the supplementary planes.

http://unicode.org/emoji/charts/emoji-zwj-sequences.html


> There'll be dramatic 
> performance costs for strings where piles of combining characters get
> loaded on top of a single base, but at least they can be accurately
> represented.

They can be accurately represented right now. E.g. there is nothing ambiguous or
inaccurate about U+1F469 U+1F3FD U+200D U+1F52C, "woman scientist with medium
skin tone".




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list