Grapheme clusters, a.k.a.real characters

Fri Jul 14 14:10:38 EDT 2017

Steve D'Aprano <steve+python at pearwood.info>:

> On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
>
> Sure it does. You want the human reader to be able to predict the
> number of graphemes ("characters") by sight. Okay, here's a string in
> UTF-8, in bytes:
>
> e288b4c39fcf89e289a0d096e280b0e282ac78e2889e
>
> How do you expect the human reader to predict the number of graphemes
> from a UTF-8 hex string?
>
> For the record, that's 44 hex digits or 22 bytes, to encode 9
> graphemes. That's an average of 2.44 bytes per grapheme. Would you
> expect the average programmer to be able to predict where the grapheme
> breaks are?
>
>> As it stands, we have
>> 
>>    è --[encode>-- Unicode --[reencode>-- UTF-8
>
> I can't even work out what you're trying to say here.

I can tell, yet that doesn't prevent you from dismissing what I'm
saying.

>> Why is one encoding format better than the other?
>
> It depends on what you're trying to do.
>
> If you want to minimize storage and transmission costs, and don't care
> about random access into the string, then UTF-8 is likely the best
> encoding, since it uses as little as one byte per code point, and in
> practice with real-world text (at least for Europeans) it is rarely
> more expensive than the alternatives.

Python3's strings don't give me any better random access than UTF-8.

Storage and transmission costs are not an issue. It's only that storage
and transmission are still defined in terms of bytes. Python3's strings
force you to encode/decode between strings and bytes for a
yet-to-be-specified advantage.

> It also has the advantage of being backwards compatible with ASCII, so
> legacy applications that assume all characters are a single byte will
> work if you use UTF-8 and limit yourself to the ASCII-compatible
> subset of Unicode.

UTF-8 is perfectly backward-compatible with ASCII.

> The disadvantage is that each code point can be one, two, three or
> four bytes wide, and naively shuffling bytes around will invariably
> give you invalid UTF-8 and cause data loss. So UTF-8 is not so good as
> the in-memory representation of text strings.

The in-memory representation is not an issue. It's the abstract
semantics that are the issue.

At the abstract level, we have the text in a human language. Neither
strings nor UTF-8 provide that so we have to settle for something
cruder. I have yet to hear why a string does a better job than UTF-8.

> If you have lots of memory, then UTF-32 is the best for in-memory
> representation, because its a fixed-width encoding and parsing it is
> simple. Every code point is just four bytes and you an easily
> implement random access into the string.

The in-memory representation is not an issue. It's the abstract
semantics that are the issue.

> If you want a reasonable compromise, UTF-16 is quite decent. If you're
> willing to limit yourself to the first 2**16 code points of Unicode,
> you can even pretend that its a fixed width encoding like UTF-32.

UTF-16 (used by Windows and Java, for example) is even worse than
strings and UTF-8 because:

    è --[encode>-- Unicode --[reencode>-- UTF-16 --[reencode>-- bytes

> If you have to survive transmission through machines that require
> 7-bit clean bytes, then UTF-7 is the best encoding to use.

I don't know why that is coming into this discussion.

So no raison-d'être has yet been offered for strings.

Marko