Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Fri Jul 14 09:31:33 EDT 2017


Steve D'Aprano <steve+python at pearwood.info>:

> These are only a *few* of the *easy* questions that need to be
> answered before we can even consider your question:
>
>> So the question is, should we have a third type for text. Or should
>> the semantics of strings be changed to be based on characters?

Sure, but if they can't be answered, what good is there in having
strings (as opposed to bytes). What problem do strings solve? What
operation depends on (or is made simpler) by having strings (instead of
bytes)?

We are not even talking about some exotic languages, but the problem is
right there in the middle of Latin-1. We can't even say what

    len("è")

should return. And we may experience:

    >>> ord("è")Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: ord() expected a character, but string of length 2 found

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?

As it stands, we have

   è --[encode>-- Unicode --[reencode>-- UTF-8

Why is one encoding format better than the other?


Marko



More information about the Python-list mailing list