Grapheme clusters, a.k.a.real characters

Michael Torrie torriem at gmail.com
Fri Jul 14 10:30:24 EDT 2017


On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
> Of course, UTF-8 in a bytes object doesn't make the situation any
> better, but does it make it any worse?

> 
> As it stands, we have
> 
>    è --[encode>-- Unicode --[reencode>-- UTF-8
> 
> Why is one encoding format better than the other?

This is precisely the logic behind Google using UTF-8 for strings in Go,
rather than having some O(1) abstract type like Python has.  And many
other languages do the same.  The argument is that because of the very
issues that you mention, having O(1) lookup in a string isn't that
important, since looking up a particular index in a unicode string is
rarely the right thing to do, so UTF-8 is just fine as a native,
in-memory type.




More information about the Python-list mailing list