Grapheme clusters, a.k.a.real characters

Wed Jul 19 11:02:29 EDT 2017

On Wed, Jul 19, 2017 at 11:42 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> Perhaps we don't have the same understanding of "constant time". Or
>> are you saying that you actually store and represent this as those
>> arbitrary-precision integers? Every character of every string has to
>> be a multiprecision integer?
>
> Yes, although feel free to optimize. The internal implementation isn't
> important but those "multiprecision" integers are part of an outward
> interface. So you could have:
>
>     >>> for c in Text("aq̈u \U0001F64B\U0001F3FF\u200D\u2642\uFE0F"):
>     ...     print(c)
>     ...
>     97
>     1895826184
>     117
>     32
>     5152920508016097895476141586773579
>
> (Note, though, that Python3 only has integers, there's no
> "multiprecision" about them.)

I said "multiprecision" because that's what the low-level
arbitrary-precision-integer library calls them - GNU Multiprecision
Integer [1].

Somehow you're going to have to store those in an indexable way, and
since you can't fit arbitrary-precision data into fixed-width slots,
O(1) addressing is going to require external storage. Basically,
you're representing a string as if it were a tuple of integers. That
makes fine sense semantically, but it's pretty costly:

>>> sys.getsizeof("hello, world")
61
>>> sys.getsizeof(tuple("hello, world"))
144

That's just for the tuple itself; then you need to represent the
actual numbers. Each one will require an addressable memory
allocation, which basically means a minimum of 8 bytes (on my 64-bit
system; you'd save a bit on a 32-bit Python, but most people don't use
those now). So every character in your string requires an 8-byte
pointer plus an 8-byte value. In contrast, current CPython requires
*at most* four bytes per character, and for many strings, requires
only one byte per character - possible only because the data is kept
as an array, not as external references.

Now, this is a performance question, and it's not unreasonable to talk
about semantics first and let performance wait for later. But when you
consider how many ASCII-only strings Python uses internally (the names
of basically every global function and every attribute in every stdlib
module), and how you'll be enlarging those by a factor of 16 *and*
making every character lookup require two pointer reads, it's pretty
much a non-starter.

There MIGHT be something to be done using a sentinel value that
represents "the actual value is somewhere else". However, I'm not sure
how you could do that cleanly in a one-byte-per-char string other than
maintaining some external table. So here's the best I can come up with
for efficiency - and it suffers horrendously from complexity:

* Strings with all codepoints < 256 are represented as they currently
are (one byte per char). There are no combining characters in the
first 256 codepoints anyway.
* Strings with all codepoints < 65536 and no combining characters,
ditto (two bytes per char).
* Strings with any combining characters in them are stored in four
bytes per char even if all codepoints are <65536.
* Any time a character consists of a single base with no combining, it
is stored in UTF-32.
* Combined characters are stored in the primary array as 0x80000000
plus the index into a secondary array where these values are stored.
* The secondary array has a pointer for each combined character
(ignoring single-code-point characters), probably to a Python integer
object for simplicity.

This scheme allows a maximum of two billion combined characters in any
string. Worst case, "a\u0303"*0x80000000 is a four billion character
string that simply can't be represented; but long before that, you'll
run out of space to allocate all those large integers. (Current
CPython represents that string in 8GB of memory. Enough to push me
into the swapper - I have only 16GB in this system and a lot of it is
in use - but nothing I can't handle.) It also has the advantage that
most strings won't change in representation. However, the complexity
is insane; I don't want to be the one to write all the unit tests to
make sure everything behaves as advertised!

Also, this system has the nasty implication that the creation of a new
combining character will fundamentally change the way a string
behaves. That means that running a slightly older version of Python
could potentially cause, not errors, but subtly different behaviour.
With Python 2.7, 3.4, 3.5, 3.6, and 3.7, I have four different major
Unicode versions, which means plenty of potential for newly-allocated
codepoints in newer Pythons. That's not usually a problem, as it only
affects a few things in the unicodedata module:

rosuav at sikorsky:~$ python3.4 -c "import unicodedata;
print(unicodedata.name('\U0001f917'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: no such name
rosuav at sikorsky:~$ python3.5 -c "import unicodedata;
print(unicodedata.name('\U0001f917'))"
HUGGING FACE

rosuav at sikorsky:~$ python3.6 -c "import unicodedata;
print(unicodedata.name('\u1df6'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: no such name
rosuav at sikorsky:~$ python3.7 -c "import unicodedata;
print(unicodedata.name('\u1df6'))"
COMBINING KAVYKA ABOVE RIGHT

But if combining characters behave fundamentally differently to
others, there would be a change in string representation when U+1DF6
became a combining character. That's going to cause MASSIVE upheaval.
I don't think there's any solution to that, but if you can find one,
do please elaborate.

ChrisA

[1] https://gmplib.org/