Grapheme clusters, a.k.a.real characters

Steve D'Aprano steve+python at pearwood.info
Fri Jul 14 21:20:33 EDT 2017


On Sat, 15 Jul 2017 07:12 am, Terry Reedy wrote:

> Does go use bytes for text, like most people did in Python 2, a separate
> text string class, that hides the internal encoding format and
> implementation?  In other words, if you do the equivalent of print(s)
> where s is a text string with a mixture of greek, cyrillic, hindi,
> chinese, japanese, and korean chars, do you see the characters, or some
> representation of the internal bytes?

The answer is, its complicated.

Go has two string types: "strings", and "runes".

Strings are equivalent to Python 3 byte-strings, except that the language is
biased towards assuming they are UTF-8 instead of Python 3's decision to assume
they are ASCII. In other words, if you display a Python 3 byte-string, it will
display bytes that represent ASCII characters as ASCII, and everything else
escaped as a hex byte:

py> b'\x41\xcf\x80\x5a'
b'A\xcf\x80Z'

Go does the same, except it will display anything which is legal UTF-8 (which
may be 1, 2, 3, or 4 bytes) as a Unicode character (actually code point).
Assuming your environment is capable of displaying that character, otherwise
you'll just see a square, or some other artifact.

So if Python used the same rules as Go, the above byte-string would display as:

b'AπZ'

Most of the time, when processing strings, Go treats them as arbitrary bytes,
although Go comes with libraries that help make it easier to work with them as
UTF-8 byte strings.

Runes, on the other hand, are a strict superset of Unicode. Runes are strings of
32-bit code units, so like UTF-32 except not limited to the Unicode range of
\U00000000 through \U0010FFFF. Runes will accept any 32 bit values up to
0xFFFFFFFF.

I presume that runes which fall within the UTF-32 range will be displayed as the
Unicode character where possible, and those which fall outside of that range as
some sort of hex display.

So Go strings are like Python byte strings, biased towards UTF-8 but with no
guarantees made, and Go runes are a superset of Python text strings.

Does that answer your question sufficiently?

https://blog.golang.org/strings


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list