Grapheme clusters, a.k.a.real characters

Terry Reedy tjreedy at udel.edu
Fri Jul 14 17:12:10 EDT 2017


On 7/14/2017 10:30 AM, Michael Torrie wrote:
> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
> 
>>
>> As it stands, we have
>>
>>     è --[encode>-- Unicode --[reencode>-- UTF-8
>>
>> Why is one encoding format better than the other?

All digital data are ultimately bits, usually collected together in 
groups of 8, called bytes.  The point of python 3 is that text should 
normally be instances of a text class, separate from the raw bytes 
class,  with a defined internal encoding.  The actual internal encoding 
is secondary.  And it changed in 3.3.

Python ints are encoded bytes, so are floats, and everything else.  When 
one prints a float, one certainly does not see a representation of the 
raw bytes in the float object.  Instead, one sees a representation of 
the value it represents. There is a proposal to change the internal 
encoding of int, as least on 64-bit machines, which are now standard. 
However, because print(87987282738472387429748) prints 
87987282738472387429748 and not the internal bytes, the change in the 
internal bytes will not affect the user view of ints.

> This is precisely the logic behind Google using UTF-8 for strings in Go,
> rather than having some O(1) abstract type like Python has.  And many
> other languages do the same.  The argument is that because of the very
> issues that you mention, having O(1) lookup in a string isn't that
> important, since looking up a particular index in a unicode string is
> rarely the right thing to do, so UTF-8 is just fine as a native,
> in-memory type.

Does go use bytes for text, like most people did in Python 2, a separate 
text string class, that hides the internal encoding format and 
implementation?  In other words, if you do the equivalent of print(s) 
where s is a text string with a mixture of greek, cyrillic, hindi, 
chinese, japanese, and korean chars, do you see the characters, or some 
representation of the internal bytes?


-- 
Terry Jan Reedy





More information about the Python-list mailing list