Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Sat Jul 15 03:50:54 EDT 2017


Steve D'Aprano <steve+python at pearwood.info>:

> On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote:
>> Python3's strings don't give me any better random access than UTF-8.
>
> Say what? Of course they do.
>
> Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of
> generality, we can say that each string is an array of four-byte code units.

Yes, and a UTF-8 byte array gives me random access to the UTF-8
single-byte code units.

Neither gives me random access to the "Grapheme clusters, a.k.a.real
characters". For example, the HFS+ file system stores uses a variant of
NFD for filenames meaning both UTF-32 and UTF-8 give you random access
to pure ASCII filenames only.

> UTF-8 is not: it is a variable-width encoding,

UTF-32 is a variable-width encoding as well. For example, "baby: medium
skin tone" is U+1F476 U+1F3FD:

  <URL: http://unicode.org/emoji/charts/full-emoji-list.html#1f476_1f3fd>

> Go ignores this problem by simply not offering random access to code
> points in strings.

Random access to code points is as uninteresting as random access to
UTF-8 bytes.

I might want random access to the "Grapheme clusters, a.k.a.real
characters". As you have pointed out, that wish is impossible to grant
unambiguously.


Marko



More information about the Python-list mailing list