Grapheme clusters, a.k.a.real characters

Marko Rauhamaa marko at pacujo.net
Fri Jul 14 15:09:18 EDT 2017


Michael Torrie <torriem at gmail.com>:

> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
>> 
>> As it stands, we have
>> 
>>    è --[encode>-- Unicode --[reencode>-- UTF-8
>> 
>> Why is one encoding format better than the other?
>
> This is precisely the logic behind Google using UTF-8 for strings in
> Go, rather than having some O(1) abstract type like Python has. And
> many other languages do the same. The argument is that because of the
> very issues that you mention, having O(1) lookup in a string isn't
> that important, since looking up a particular index in a unicode
> string is rarely the right thing to do, so UTF-8 is just fine as a
> native, in-memory type.

It pays to come in late.

Windows NT and Java evaded the 8-bit localization nightmare by going
UCS-2.

Python3 managed not to repeat the earlier UCS-2 blunders by going all
the way to UCS-4.

Go saw the futility of UCS-4 as a separate data type and dropped down to
UTF-8.

Unfortunately, Guile is following in Python3's footsteps.


Marko



More information about the Python-list mailing list