Grapheme clusters, a.k.a.real characters

Gregory Ewing greg.ewing at canterbury.ac.nz
Wed Jul 19 18:12:09 EDT 2017


Chris Angelico wrote:
> * Strings with all codepoints < 256 are represented as they currently
> are (one byte per char). There are no combining characters in the
> first 256 codepoints anyway.
> * Strings with all codepoints < 65536 and no combining characters,
> ditto (two bytes per char).
> * Strings with any combining characters in them are stored in four
> bytes per char even if all codepoints are <65536.
> * Any time a character consists of a single base with no combining, it
> is stored in UTF-32.
> * Combined characters are stored in the primary array as 0x80000000
> plus the index into a secondary array where these values are stored.
> * The secondary array has a pointer for each combined character
> (ignoring single-code-point characters), probably to a Python integer
> object for simplicity.

+1. We should totally do this just to troll the RUE!

-- 
Greg



More information about the Python-list mailing list