Grapheme clusters, a.k.a.real characters

Thu Jul 20 23:43:39 EDT 2017

On Fri, Jul 21, 2017 at 1:20 PM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> On Fri, 21 Jul 2017 04:05 am, Marko Rauhamaa wrote:
>
>> If any string code point is 1114112 or greater
>
> By definition, no Unicode code point can ever have an ordinal value greater than
> 0x10FFFF = 1114111.
>
> So I don't know what you're talking about, but it isn't Unicode. If you want to
> invent your own text standard to compete with Unicode, well, good luck getting
> people behind it. Especially since it seems to offer nothing but a vast
> increase in complexity and memory usage for no apparent benefit that I can see.

He's talking specifically about an in-memory representation. The term
"code point" is inaccurate; I think "code unit" is more accurate, but
normally a code unit represents *at most* one code point, and
potentially less (UTF-16 and astral chars) - this is using it to
represent *more* than one code point. So maybe it needs a different
word.

Nonetheless, this is a reasonably internally-consistent way to
represent textual data. It's a three-tier system:

* The string is stored as a series of 32-bit integers, where each
integer is either a UTF-32 code unit, or 1114112+n for some value n.
* A secondary array of 64-bit integers stores either odd integers
consisting of three 21-bit numbers packed together and then the low
bit set as a flag, or even integers representing pointers.
* The heap is tertiary memory for those combined characters consisting
of more than three code points (base character plus more than two
combining characters).

And while this is extremely complicated, it does at least push most of
the complexity to the unusual cases. A pure ASCII string is still able
to be represented exactly the same way Python 3.3+ does; all it needs
is a spare patch of memory representing an empty array of integers.
(Smarter people than I may be able to store this array in zero bytes
without getting things confused. I'm not sure.) Strings with all code
points on the BMP and no combining characters are still able to be
represented as they are today, again with the empty secondary array.
The presence of a single combining character in the string does force
it to be stored 32 bits per character, so there can be a price to pay.
Similarly, the secondary array will only VERY rarely need to contain
any pointers; most combined characters consist of a base and one
combining, or a set of three characters at most. There'll be dramatic
performance costs for strings where piles of combining characters get
loaded on top of a single base, but at least they can be accurately
represented.

However, there's still one major MAJOR problem. The semantics of
string handling depend on having a proper table of Unicode character
types (or at least a listing of every combining character). As the
Unicode standard is augmented, the number of combining characters can
increase, and I don't think the Consortium has pre-allocated space
saying "this is where combining characters will be placed". So what
should be done if there's an unknown code point in a string? Should an
exception be raised? Should it assume that it's a base character?
Either way, you have versioning problems, and big ones.

As such, this could be extremely useful as a tool for textual
analysis, but it should NOT, in my opinion, be the internal
representation of a string.

ChrisA