Grapheme clusters, a.k.a.real characters

Fri Jul 14 22:33:02 EDT 2017

On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote:

> Steve D'Aprano <steve+python at pearwood.info>:
>> On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote:
[...]
>>> As it stands, we have
>>> 
>>>    è --[encode>-- Unicode --[reencode>-- UTF-8
>>
>> I can't even work out what you're trying to say here.
> 
> I can tell, yet that doesn't prevent you from dismissing what I'm
> saying.

How am I dismissing it? I didn't reply to it except to say I don't understand
it! To me, it looks like gibberish, not even wrong, but rather than say so I
thought I'd give you the opportunity to explain what you meant.

As the person attempting to communicate, any failure to do so is *your*
responsibility, not that of the reader. If you are discussing this in good
faith, rather than as a cheap points-scoring exercise, then please try to
explain what you mean.

>>> Why is one encoding format better than the other?
>>
>> It depends on what you're trying to do.
>>
>> If you want to minimize storage and transmission costs, and don't care
>> about random access into the string, then UTF-8 is likely the best
>> encoding, since it uses as little as one byte per code point, and in
>> practice with real-world text (at least for Europeans) it is rarely
>> more expensive than the alternatives.
> 
> Python3's strings don't give me any better random access than UTF-8.

Say what? Of course they do.

Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of
generality, we can say that each string is an array of four-byte code units.

(In practice, depending on the string, Python may be able to compact that to
one- or two-byte code units.)

The critical thing is that slicing and indexing is a constant-time operation.
string[i] can just jump straight to offset i code-units into the array. If the
code-units are 4 bytes wide, that's just 4*i bytes.

UTF-8 is not: it is a variable-width encoding, so there's no way to tell how
many bytes it takes to get to string[i]. You have to start at the beginning of
the string and walk the bytes, counting code points, until you reach the i-th
code point.

It may be possible to swap memory for time by building an augmented data
structure that makes this easier. A naive example would be to have a separate
array giving the offsets of each code point. But then its not a string any
more, its a more complex data structure.

Go ignores this problem by simply not offering random access to code points in
strings. Go simply says that strings are bytes, and if string[i] jumps into the
middle of a character (code point), oh well, too bad, so sad.

On the other hand, Go also offers a second solution to the problem. Its
essentially the same solution that Python offers: a dedicated fixed-width,
32-bit (four byte) Unicode text string type which they call "runes".

> Storage and transmission costs are not an issue.

I was giving a generic answer to a generic question. You asked a general
question, "Why is one encoding format better than the other?" and the general
answer to that is *it depends on what you are trying to do*.

> It's only that storage and transmission are still defined in terms of bytes.

Again, I don't see what point you think you are making here. Ultimately, all our
data structures have to be implemented in memory which is addressable in bytes.
*All of them* -- objects, linked lists, floats, BigInts, associative arrays,
red-black trees, the lot.

All of those data structures are presented to the programmer in terms of higher
level abstractions. You seem to think that text strings alone don't need that
higher level abstraction, and that the programmer ought to think about text in
terms of bytes. Why?

You entered this discussion with a reasonable position: the text primitives
offered to programmers fall short of what we'd like, which is to deal with
language in terms of language units: characters specifically. (Let's assume we
can decide what a character actually is.) I agree! If Python's text strings are
supposed to be an abstraction for "strings of characters", its a leaky
abstraction. It's actually "strings of code points".

Some people might have said:

"Since Python strings fall short of the abstraction we would like, we should
build a better abstraction on top of it, using Unicode primitives, that deals
with characters (once we decide what they are)."

which is where I thought you were going with this. But instead, you've suggested
that the solution to the problem:

"Python strings don't come close enough to matching the programmer's
expectations about characters"

is to move *further away* from the programmer's expectations about characters
and to have them reason about UTF-8 encoded bytes instead.

And then to insult our intelligence even further, after raising the in-memory
representation (UTF-8 versus some other encoding) to prominence, you then
repeatedly said that the in-memory representation doesn't matter!

If it doesn't matter, why do you care whether strings use UTF-8 or UTF-32 or
something else?

> Python3's strings 
> force you to encode/decode between strings and bytes for a
> yet-to-be-specified advantage.

That's simply wrong. You are never forced to encode/decode if you are dealing
with strings alone, or bytes alone. You only need to encode/decode when
converting between the two.

You don't even need to explicitly decode when dealing with file I/O. Provided
your files are correctly encoded, Python abstracts away the need to decode and
you can just read text out of a file. So your statement is wrong.

>> It also has the advantage of being backwards compatible with ASCII, so
>> legacy applications that assume all characters are a single byte will
>> work if you use UTF-8 and limit yourself to the ASCII-compatible
>> subset of Unicode.
> 
> UTF-8 is perfectly backward-compatible with ASCII.

No it isn't. ASCII is a 7-bit encoding. No valid ASCII data has the 8th bit set.
UTF-8 uses 8 bits, e.g. π in UTF-8 uses two bytes:

\xcf\x80

in hex, which are:

0b11001111 0b10000000

in binary. As you can see, the eighth bit is set in both of those bytes.

UTF-8 is only backwards compatible with ASCII if you limit yourself to the ASCII
subset of Unicode, i.e. the 128 values between U+0000 and U+007F.

>> The disadvantage is that each code point can be one, two, three or
>> four bytes wide, and naively shuffling bytes around will invariably
>> give you invalid UTF-8 and cause data loss. So UTF-8 is not so good as
>> the in-memory representation of text strings.
> 
> The in-memory representation is not an issue. It's the abstract
> semantics that are the issue.

What?

You're asking about *encodings*. By definition, that means you're talking about
the in-memory representation.

Dear gods man, this is like you asking "Which makes for a better car, gasoline,
diesel, LPG, electric or hydrogen?" and then when I start to discuss the
differences between the fuels you say "I don't care about the internal
differences of the engines, I only care about controls on the dashboard".

Marko, it is times like this I think you are trolling, and come really close to
just kill-filing you. You explicitly asked about encodings, so I answered your
question about encodings. For you to now say that the encoding is irrelevant,
well, just stop wasting my time.

I don't think you are discussing this in good faith. I think you are arguing to
win, no matter how incoherent your argument becomes, so long as you "win" for
some definition of winning. I don't have infinite patience for that sort of
behaviour.

> At the abstract level, we have the text in a human language. Neither
> strings nor UTF-8 provide that so we have to settle for something
> cruder. I have yet to hear why a string does a better job than UTF-8.

This is not even wrong.

You are comparing a data structure, string, with a mapping, UTF-8. They aren't
alternatives that we get to choose between, like "strings versus ropes"
or "UTF-8 versus ISO-8859-3". They are *complementary* not alternatives: 

we can have strings of UTF-8 encoding text, or strings of ISO-8859-3 bytes, or
ropes of UTF-8 encoded text, or ropes of ISO-8859-3 bytes.

To give an analogy, you're saying "I have yet to hear why cars do a better job
than electric motors."

> UTF-16 (used by Windows and Java, for example) is even worse than
> strings and UTF-8 because:
> 
>     è --[encode>-- Unicode --[reencode>-- UTF-16 --[reencode>-- bytes

Taken at face value, this doesn't make sense. It's just gibberish.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.