Grapheme clusters, a.k.a.real characters

Fri Jul 14 03:40:06 EDT 2017

On Fri, Jul 14, 2017 at 4:30 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Unicode was supposed to get us out of the 8-bit locale hole. Now it
> seems the Unicode hole is far deeper and we haven't reached the bottom
> of it yet. I wonder if the hole even has a bottom.
>
> We now have:
>
>  - an encoding: a sequence a bytes
>
>  - a string: a sequence of integers (code points)
>
>  - "a snippet of text": a sequence of characters

Before Unicode, we had exactly the same thing, only with more encodings.

> Assuming "a sequence of characters" is the final word, and Python wants
> to be involved in that business, one must question the usefulness of
> strings, which are neither here nor there.
>
> When people use Unicode, they are expecting to be able to deal in real
> characters. I would expect:
>
>    len(text)               to give me the length in characters
>    text[-1]                to evaluate to the last character
>    re.match("a.c", text)   to match a character between a and c
>
> So the question is, should we have a third type for text. Or should the
> semantics of strings be changed to be based on characters?

What is the length of a string? How often do you actually care about
the number of grapheme clusters - and not, for example, about the
pixel width? (To columnate text, for instance, you need to know about
its width in pixels or millimeters, not the number of characters in
the line.) And if you're going to group code points together because
some of them are combining characters, would you also group them
together because there's a zero-width joiner in the middle? The answer
will sometimes be "yes of course" and sometimes "of course not". These
kinds of linguistic considerations shouldn't be codified into the core
of the language.

IMO the Python str type is adequate as a core data type. What we may
need, though, is additional utility functions, eg:

* unicodedata.grapheme_clusters(str) - split str into a sequence of
grapheme clusters
* pango.get_text_extents(str) - measure the pixel dimensions of a line of text
* platform.punish_user() - issue a platform-dependent response (such
as an electric shock, a whack with a 2x4, or a dropped anvil) on
someone who has just misunderstood Unicode again
* socket.punish_user() - as above, but to the user at the opposite end
of a socket

ChrisA