Grapheme clusters, a.k.a.real characters

Steve D'Aprano steve+python at pearwood.info
Fri Jul 14 08:07:03 EDT 2017


On Fri, 14 Jul 2017 04:30 pm, Marko Rauhamaa wrote:

> Unicode was supposed to get us out of the 8-bit locale hole.

Which it has done. Apart from use for backwards compatibility, there is no good
reason to use to use the masses of legacy extensions to ASCII or the technical
fragile non-Unicode multibyte encodings from China and Japan.

Backwards compatibility is important, but for new content we should all support
Unicode.


> Now it 
> seems the Unicode hole is far deeper and we haven't reached the bottom
> of it yet. I wonder if the hole even has a bottom.

This is not a Unicode hole. This is a human languages hole, compounded by the
need for backwards compatibility with legacy encodings.


> We now have:
> 
>  - an encoding: a sequence a bytes
>
>  - a string: a sequence of integers (code points)
> 
>  - "a snippet of text": a sequence of characters

I'm afraid that's wrong, and much too simplified. What we have had, ever since
computers started having standards for the storage and representation of text
(i.e. since EBCDIC at the very least, possibly even earlier), is:

(1) A **character set** made up of some collection of:

    - alphabetical letters, characters, syllabograms, ideographs or logographs
    - digits and other numeric symbols
    - punctuation marks
    - other textual marks, including diacritics ("accent marks")
    - assorted symbols, icons, pictograms or hieroglyphics
    - control and formatting codes
    - white space and other text separators
    - and any other entities that have text-like semantics.

The character set is the collection of entities we would like to represent as
computer data. But of course computers can't store "the letter Aye" A or "the
letter Zhe" Ж so we also need:

(2) A (possibly implicit) mapping between the entities in the character 
    set and some contiguous range of abstract numeric values ("code points").

(3) The **encoding**, an explicit mapping between those abstract code points
    and some concrete representation suitable for use as storage or transmission
    by computers. That is usually which means a sequence of "code units", where
    each code unit is typically one, two or four bytes.

Note that a single character set could have multiple encodings.

In pre-Unicode encodings such as ASCII, the difference between (1) and (2) was
frequently (always?) glossed over. For example, in ASCII:

- the character set was made up of 128 control characters, American English
  letters, digits and punctuation marks;

- there is an implicit mapping between (say) "character A is code point 65";

- there is also an explicit mapping between "character A (i.e. code point 65)
  is byte 0x41 (decimal 65)".

So the legacy character set and encoding standards helped cause confusion, by
implying that "characters are bytes" instead of making the difference explicit.

In addition, we have:

(4) Strings, ropes and other data structures suitable for the storage of
    **sequences of code points** (characters, codes, symbols etc); strings
    being the simplest implementation (a simple array of code units), but
    they're not the only one.

We also have:

(5) Human-meaningful chunks of text: characters, graphemes, words, sentences, 
    symbols, paragraphs, pages, sections, chapters, snippets or what have you.


There's no direct one-to-one correspondence between (5) and (4). A string can
just as easily contain half a word "aard" as a full word "aardvark".

And let's not forget:

(6) The **glyphs** of each letter, symbol, etc, encompassing the visual shape
    and design of those chunks of text, which can depend on the context. For
    example, the Greek letter sigma looks different depending on whether it
    is at the end of a word or not.


> Assuming "a sequence of characters" is the final word, 

Why would you assume that? Let's start with, what's a character?


> and Python wants 
> to be involved in that business, one must question the usefulness of
> strings, which are neither here nor there.

Sure, you can question anything you like, its a free country[1], but unless you
have a concrete plan for something better and are willing to implement it, the
chances are very high that nothing will happen.

The vast majority of programming languages provide only a set of low-level
primitives for manipulating strings, with no semantic meaning enforced. If you
want to give *human meaning* to your strings, you need something more than just
the string-handling primitives your computer language provides. This was just
as true in the old days of ASCII as it is today with Unicode: your computer
language is just as happy making a string containing the nonsense word 
"vxtsEpdlu" as the real word "Norwegian".


> When people use Unicode, they are expecting to be able to deal in real
> characters.

Then their expectations are too high and they are misinformed. Unicode is not a
standard for implementing human-meaningful text (although it takes a few steps
towards such a standard).

Unicode doesn't even have a concept of "character". Indeed, as I hinted above by
asking you what is a character, such a concept isn't well defined. Unicode
prefers to use the technical term "grapheme", which (usually) encompasses
what "ordinary people consider a character in their native language".

If this strikes you as complicated, well, yes, it is complicated. The writing
systems of the world ARE complicated, and they clash.


> I would expect: 
> 
>    len(text)               to give me the length in characters
>    text[-1]                to evaluate to the last character
>    re.match("a.c", text)   to match a character between a and c

Until we have agreement on what is a character, we can't judge whether or not
this is meaningful.

For example:

- do 'ς' and 'σ' count as the same character?

- is '%' one symbol or three? How about '½'? If I write it as '1/2' does it 
  make a difference?

- are ligatures like 'æ' one or two letters?

- when should 'ß' uppercase to 'SS' and when to 'ẞ'?

- do you lowercase 'SS' to 'ss' or 'ß'?

- do you uppercase 'i' to 'I' or 'İ'?

- can we distinguish between 'I' as in me and 'I' as in the Roman numeral 1?

- should the English letter 'a' with a hook on the top be treated as 
  different to the letter 'a' without the hook on top?

- should English italic letters get their own code point?

- how about Cyrillic italic letters?

- does it make a difference if we're using them as mathematical symbols?

- should we have a separate 'A' for English, French, German, Spanish,
  Portuguese, Norwegian, Dutch, Italian, etc?

- how about a separate '一' for Chinese, Japanese and Korean?

These are only a *few* of the *easy* questions that need to be answered before
we can even consider your question:

> So the question is, should we have a third type for text. Or should the
> semantics of strings be changed to be based on characters?





[1] For now.


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list