Grapheme clusters, a.k.a.real characters

Terry Reedy tjreedy at udel.edu
Sat Jul 15 02:27:22 EDT 2017


On 7/14/2017 9:20 PM, Steve D'Aprano wrote:
> On Sat, 15 Jul 2017 07:12 am, Terry Reedy wrote:
> 
>> Does go use bytes for text, like most people did in Python 2, a separate
>> text string class, that hides the internal encoding format and
>> implementation?  In other words, if you do the equivalent of print(s)
>> where s is a text string with a mixture of greek, cyrillic, hindi,
>> chinese, japanese, and korean chars, do you see the characters, or some
>> representation of the internal bytes?
> 
> The answer is, its complicated.
> 
> Go has two string types: "strings", and "runes".
> 
> Strings are equivalent to Python 3 byte-strings, except that the language is
> biased towards assuming they are UTF-8 instead of Python 3's decision to assume
> they are ASCII. In other words, if you display a Python 3 byte-string, it will
> display bytes that represent ASCII characters as ASCII, and everything else
> escaped as a hex byte:
> 
> py> b'\x41\xcf\x80\x5a'
> b'A\xcf\x80Z'
> 
> Go does the same, except it will display anything which is legal UTF-8 (which
> may be 1, 2, 3, or 4 bytes) as a Unicode character (actually code point).
> Assuming your environment is capable of displaying that character, otherwise
> you'll just see a square, or some other artifact.
> 
> So if Python used the same rules as Go, the above byte-string would display as:
> 
> b'AπZ'
> 
> Most of the time, when processing strings, Go treats them as arbitrary bytes,
> although Go comes with libraries that help make it easier to work with them as
> UTF-8 byte strings.
> 
> Runes, on the other hand, are a strict superset of Unicode. Runes are strings of
> 32-bit code units, so like UTF-32 except not limited to the Unicode range of
> \U00000000 through \U0010FFFF. Runes will accept any 32 bit values up to
> 0xFFFFFFFF.
> 
> I presume that runes which fall within the UTF-32 range will be displayed as the
> Unicode character where possible, and those which fall outside of that range as
> some sort of hex display.
> 
> So Go strings are like Python byte strings, biased towards UTF-8 but with no
> guarantees made, and Go runes are a superset of Python text strings.
> 
> Does that answer your question sufficiently?
> 
> https://blog.golang.org/strings

Yes, thank you.


-- 
Terry Jan Reedy





More information about the Python-list mailing list