Grapheme clusters, a.k.a.real characters

Thu Jul 20 01:15:20 EDT 2017

On Thu, 20 Jul 2017 12:40:08 +1000, Chris Angelico wrote:

> On Thu, Jul 20, 2017 at 12:12 PM, Steve D'Aprano
> <steve+python at pearwood.info> wrote:
>> On Thu, 20 Jul 2017 08:12 am, Gregory Ewing wrote:
>>
>>> Chris Angelico wrote:
>> [snip overly complex and complicated string implementation]
>>
>>
> An accurate description, but in my own defense, I had misunderstood
> Marko's idea. Actually, the implementation I detailed was far SIMPLER
> than I thought it would be; I started writing that post trying to prove
> that it was impossible, but it turns out it isn't actually impossible.
> Just highly impractical.

I haven't really been paying attention to Marko's suggestion in detail, 
but if we're talking about a whole new data type, how about a list of 
nodes, where each node's data is a decomposed string object guaranteed to 
be either:

* an array of single-character code points;

* or a single combining character sequence, emoji variation 
  sequence, or other grapheme

* with the appropriate length in "characters".

So a string like "Hello World!" would be a single node:

(12, "Hello World!")

Strings will always be in decomposed form, so a "café au lait" would 
automatically use:

U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT

regardless of your editor, and represented by three nodes:

(3, "caf") (1, "e\N{COMBINING ACUTE ACCENT}") (8, " au lait")

Iterating the string would mean:

for node in nodes:
    if node.count = 1:
        yield node.string
    else:
        yield from node.string

Reversing would be:

for node in nodes[::-1]:
    if node.count = 1:
        yield node.string
    else:
        yield from node.string[::-1]

Getting the length in graphemes would mean:

sum(node.count for node in nodes)

Indexing and slicing I leave for an exercise.

We lose O(1) indexing and slicing, but the length could be cached. 
Calculate the length on demand, then cache it for next time. (This 
assumes the string is immutable.) Indexing and slicing would be 
proportional to the number of nodes, not the length of the string. So not 
as bad as a naive UTF-8 implementation.

Each substring could use the Flexible String Representation, to minimize 
the total memory. E.g. in the "café au lait" example above, the first and 
last node would use one byte per code point, and the middle node would 
use two bytes per code point.

Of course, this doesn't *completely* solve the question of end user 
expectations. For example, many people would want "ij" to count as a 
single character, or "\r\n". And it would complicate the implementation 
of the various string methods and the regex engine. It will almost 
certainly be much slower than the str type, and use more memory, and it 
would be lossy with regards to certain patterns of code points.

For example, it wouldn't distinguish between composed and decomposed 
strings, since they're always normalised to decomposed form.

But it might be worth doing, for applications that care about giving a 
better user experience when it comes to editing text.

-- 
Steve