[I18n-sig] Re: Unicode debate

Guido van Rossum guido@python.org
Tue, 02 May 2000 08:31:55 -0400


>     No automatic conversions between 8-bit "strings" and Unicode strings.
> 
> If you want to turn UTF-8 into a Unicode string, say so.
> If you want to turn Latin-1 into a Unicode string, say so.
> If you want to turn ISO-2022-JP into a Unicode string, say so.
> Adding a Unicode string and an 8-bit "string" gives an exception.

I'd accept this, with one change: mixing Unicode and 8-bit strings is
okay when the 8-bit strings contain only ASCII (byte values 0 through
127).  That does the right thing when the program is combining
ASCII data (e.g. literals or data files) with Unicode and warns you
when you are using characters for which the encoding matters.  I
believe that this is important because much existing code dealing with
strings can in fact deal with Unicode just fine under these
assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
it deal with Unicode strings -- those changes were all getattr() and
setattr() calls.)

When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
bytes in either should make the comparison fail; when ordering is
important, we can make an arbitrary choice e.g. "\377" < u"\200".

Why not Latin-1?  Because it gives us Western-alphabet users a false
sense that our code works, where in fact it is broken as soon as you
change the encoding.

> P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
> sense of them as byte-buffers -- since that *is* all you get when you
> read in some bytes from a file.  If you manipulate an 8-bit "string"
> as a character string, you are implicitly making the assumption that
> the byte values correspond to the character encoding of the character
> repertoire you want to work with, and that's your responsibility.

This is how I think of them too.

> P. P. S.  If always having to specify encodings is really too much,
> i'd probably be willing to consider a default-encoding state on the
> Unicode class, but it would have to be a stack of values, not a
> single value.

Please elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)