Grapheme clusters, a.k.a.real characters

Fri Jul 14 17:51:19 EDT 2017

Terry Reedy <tjreedy at udel.edu>:

> On 7/14/2017 10:30 AM, Michael Torrie wrote:
>> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
>>> Of course, UTF-8 in a bytes object doesn't make the situation any
>>> better, but does it make it any worse?
>>
>>>
>>> As it stands, we have
>>>
>>>     è --[encode>-- Unicode --[reencode>-- UTF-8
>>>
>>> Why is one encoding format better than the other?
>
> All digital data are ultimately bits, usually collected together in
> groups of 8, called bytes.

Naturally.

> The point of python 3 is that text should normally be instances of a
> text class, separate from the raw bytes class, with a defined internal
> encoding.

And I called its usefulness into question.

>> This is precisely the logic behind Google using UTF-8 for strings in Go,
>> rather than having some O(1) abstract type like Python has.  And many
>> other languages do the same.  The argument is that because of the very
>> issues that you mention, having O(1) lookup in a string isn't that
>> important, since looking up a particular index in a unicode string is
>> rarely the right thing to do, so UTF-8 is just fine as a native,
>> in-memory type.
>
> Does go use bytes for text, like most people did in Python 2,

Yes. Also, C and the GNU textutils do that.

> a separate text string class, that hides the internal encoding format
> and implementation? In other words, if you do the equivalent of
> print(s) where s is a text string with a mixture of greek, cyrillic,
> hindi, chinese, japanese, and korean chars, do you see the characters,
> or some representation of the internal bytes?

Yes, in Python2, Go, C and GNU textutils, when you print a text string
containing a mixture of languages, you see characters.

Why?

Because that's what the terminal emulator chooses to do upon receiving
those bytes.

Marko