Flexible string representation, unicode, typography, ...

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Aug 26 07:49:38 EDT 2012


On Sat, 25 Aug 2012 23:59:34 -0700, wxjmfauth wrote:

> Le dimanche 26 août 2012 00:26:56 UTC+2, Ian a écrit :

>> More seriously, strings in Go are not sequences of runes.  They're
>> actually arrays of UTF-8 bytes.

Actually, it's worse that that. Strings in Go aren't even proper UTF-8. 
They are arbitrary bytes, which means you can create strings which are 
invalid Unicode.

Go looks like an interesting language, but it seems to me that they have 
totally screwed up strings. At least Python had the excuse that it is 20 
years old and carrying the old ASCII baggage. Nobody used Unicode in 1992 
when Python was invented. What is Google's excuse for getting Unicode 
wrong?

In Go, strings are UTF-8 encoded sequences of bytes, except when they're 
not, in which case they're arbitrary bytes. You can't tell if a string is 
valid UTF-8 unless you carefully inspect every single character and 
decide for yourself if it is valid. Don't know the rules for valid UTF-8? 
Too bad.

This also means that basic string operations like slicing are both *slow* 
and *wrong* -- they are slow, because you have to track character 
boundaries yourself. And they are wrong, because most people won't 
bother, they'll just assume each character is one byte.

See here for more information:

http://comments.gmane.org/gmane.comp.lang.go.general/56245

Some useful quotes:

-  "Strings are *not* required to be UTF-8."

- "If the string must always be valid UTF-8 then relatively expensive
   validation is required for many operations. Plus making those
   operations able to fail complicates the interface."

- "In almost all cases strings are just byte arrays."

- "Go simply doesn't have 8-bit Unicode strings"

- "Python3 can afford the luxury of storing strings in UCS-2/UCS-4, 
  Go can't."

I don't question that Go needs a type for arbitrary bytes. But that 
should be "bytes", not "string", and it should be there for the advanced 
programmers who *need* to worry about bytes. Programmers who want to 
include strings in their applications (i.e. all of them) shouldn't need 
to care that "$" is one byte, "¢" is two, "€" is three, and "𤭢" 
(U+24B62) is four. With Python 3.3, it *just works*. With Go, it doesn't.

In my not-so-humble opinion, Go has made a silly design error. Go 
programmers will be paying for this mistake for at least a decade. What 
they should have done is create two data types:

1) Strings which are guaranteed to be valid Unicode. That could be UTF-32 
or a PEP 393 approach, depending on how much memory you want to use, or 
even UTF-16 if you don't mind the complication of surrogate pairs.

2) Bytes which are not guaranteed to be valid Unicode but let the 
programmer work with arbitrary bytes.

(If this sounds familiar, it should -- it is exactly what Python 3 does. 
We have a string type that guarantees to be valid Unicode, and a bytes 
type that doesn't.)

As given, *every single programmer* who wants to use Unicode in Go is now 
responsible for doing all the hard work of validating UTF-8, converting 
from bytes to strings, etc. Sure, eventually Go will have libraries to do 
that, but not yet, and even when it does, many people will not use them 
and their code will fail to handle Unicode correctly.

Right now, every Go programmer who wants Unicode has to pay the cost of 
the freedom to have arbitrary byte sequences, whether they need those 
arbitrary bytes or not. The consequence is that instead of Go making 
Unicode as trivial and easy to use as it should be, it will be hard to 
get right, annoying, slow and painful. Another generation of programmers 
will grow up thinking that Unicode is all too difficult and we should 
stick to just plain ASCII.

Since Go doesn't have Unicode strings, you can never trust that a string 
is valid UTF-8, you can't slice it efficiently, you can't get the length 
in characters, you can't write it to a file and have other applications 
to be able to read it. Sure, sometimes it will work, and then somebody 
will input a Euro sign into your application, and it will blow up.

Why am I not surprised that JMF misunderstands both Go byte-strings and 
Python Unicode strings?


> Sorry, you do not get it.
> 
> The rune is an alias for int32. A sequence of runes is a sequence of
> int32's.

It certainly is not. Runes are variable-width. Here, for example, are a 
number of Go functions which return a single rune and its width in bytes:

http://golang.org/pkg/unicode/utf8/


> Go do not spend its time in using a machinery to work with, to
> differentiate, to keep in memory this sequence according to the
> *characers* composing this "array of code points".
> 
> The message is even stronger. Use runes to work comfortably [*] with
> unicode:
> rune -> int32 -> utf32 -> unicode (the perfect scheme, cann't be better)

Runes are not int32, and int32 is not UTF-32.

Whether UTF-32 is the "perfect scheme" for Unicode is a matter of opinion.



-- 
Steven



More information about the Python-list mailing list