Grapheme clusters, a.k.a.real characters

Fri Jul 14 12:50:33 EDT 2017

On Fri, 14 Jul 2017 11:31 pm, Marko Rauhamaa wrote:

> Steve D'Aprano <steve+python at pearwood.info>:
> 
>> These are only a *few* of the *easy* questions that need to be
>> answered before we can even consider your question:
>>
>>> So the question is, should we have a third type for text. Or should
>>> the semantics of strings be changed to be based on characters?
> 
> Sure, but if they can't be answered, what good is there in having
> strings (as opposed to bytes). 

I didn't say they can't be answered. But however you answer them, you're going
to make somebody angry.

I notice you haven't given a definition for "character" yet. It's easy to be
critical and complain that Unicode strings don't handle "characters", but if
you can't suggest any improvements, then you're just bellyaching.

Do you have some concrete improvements in mind?

> What problem do strings solve?

Well, to start with it's a lot nicer to be able to write:

    name = input("What is your name?")

instead of:

    name = input("5768617420697320796f7572206e616d653f")

don't you think? I think that alone makes strings worth it.

And of course, I don't want to be limited to just US English, or one language at
a time. So we need a universal character set.

> What  
> operation depends on (or is made simpler) by having strings (instead of
> bytes)?

Code is written for people first, and to be executed by a computer only second.
So we want human-readable text to look as much like human-readable text.

Although I suppose computer keyboards would be a lot smaller if they only needed
16 keys marked 0...9ABCDEF instead of what we have now. We could program by
entering bytes:

6e616d65203d20696e70757428225768617420697320796f7572206e616d653f22290a7072696e742822596f7572206e616d652069732025722e222025206e616d6529

although debugging would be a tad more difficult, I expect. But the advantage
is, we'd have one less data type!

I mean, sure, *some* stick-in-the-mud old fashioned programmers would prefer to
write:

name = input("What is your name?")
print("Your name is %r." % name)

but I think your suggestion of eliminating strings and treating everything as
bytes has its advantages. For starters, everything is a one-liner!

Bytes, being a sequence of numbers, shouldn't define text operations like
converting to uppercase, regular expressions, and so forth. Of course the
Python 3 bytes data type does support some limited text operations, but that's
for backward compatibility with pre-Unicode Python, and its limited to ASCII.
If we were designing Python from scratch, I'd argue strongly against adding
text methods to a sequence of numbers.

> We are not even talking about some exotic languages, but the problem is
> right there in the middle of Latin-1. We can't even say what
>
>     len("è")
> 
> should return.

Latin-1 predates Unicode, so this problem has existed for a long time. It's not
something that Unicode has introduced, it is inherent to the problem of dealing
with human language in its full generality.

Do you have a solution for this? How do you get WYSIWYG display of text without
violating the expectation that we should be able to count the length of a
string?

Before you answer, does your answer apply to Arabic and Thai as well as Western
European languages?

> And we may experience: 
> 
>     >>> ord("è")Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     TypeError: ord() expected a character, but string of length 2 found

You might, but only as a contrived example. You had to intentionally create a
decomposed string of length two as a string literal, and then call ord(). But
of course you knew that was going to happen -- its not something likely to
happen by accident. In practice, when you receive an arbitrary string, you test
its length before calling ord(). Or you walk the string calling ord() on each
code point.

> Of course, UTF-8 in a bytes object doesn't make the situation any
> better, but does it make it any worse?

Sure it does. You want the human reader to be able to predict the number of
graphemes ("characters") by sight. Okay, here's a string in UTF-8, in bytes:

e288b4c39fcf89e289a0d096e280b0e282ac78e2889e

How do you expect the human reader to predict the number of graphemes from a
UTF-8 hex string?

For the record, that's 44 hex digits or 22 bytes, to encode 9 graphemes. That's
an average of 2.44 bytes per grapheme. Would you expect the average programmer
to be able to predict where the grapheme breaks are?

> As it stands, we have
> 
>    è --[encode>-- Unicode --[reencode>-- UTF-8

I can't even work out what you're trying to say here.

> Why is one encoding format better than the other?

It depends on what you're trying to do.

If you want to minimize storage and transmission costs, and don't care about
random access into the string, then UTF-8 is likely the best encoding, since it
uses as little as one byte per code point, and in practice with real-world text
(at least for Europeans) it is rarely more expensive than the alternatives.

It also has the advantage of being backwards compatible with ASCII, so legacy
applications that assume all characters are a single byte will work if you use
UTF-8 and limit yourself to the ASCII-compatible subset of Unicode.

The disadvantage is that each code point can be one, two, three or four bytes
wide, and naively shuffling bytes around will invariably give you invalid UTF-8
and cause data loss. So UTF-8 is not so good as the in-memory representation of
text strings.

If you have lots of memory, then UTF-32 is the best for in-memory
representation, because its a fixed-width encoding and parsing it is simple.
Every code point is just four bytes and you an easily implement random access
into the string.

If you want a reasonable compromise, UTF-16 is quite decent. If you're willing
to limit yourself to the first 2**16 code points of Unicode, you can even
pretend that its a fixed width encoding like UTF-32.

If you have to survive transmission through machines that require 7-bit clean
bytes, then UTF-7 is the best encoding to use.

As for the legacy encodings:

- they're not 7-bit clean, except for ASCII;

- some of them are variable-width;

- none of them support the full range of Unicode, so they aren't universal
character sets;

- in other words, you either resign yourself to being unable to exchange
documents with other people, resign yourself to dealing with moji-bake, or
invent some complex and non-backwards-compatible in-band mechanism for
switching charsets;

- they suffer from the exact same problems as Unicode regarding the distinction
between code points and graphemes;

- so not only do they lack the advantages of Unicode, but they have even more
disadvantages.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.