[I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Tue, 02 May 2000 10:53:19 -0400


> [GvR]
> >The big deal is that in some cultures, 8-bit strings with non-ASCII
> >bytes are unlikely to be Latin-1.  Under the Latin-1 convention, they
> >would get garbage when mixing Unicode and regular strings.

[Just]
> They would also get garbage under the utf-8 convention, so again, a
> moot point.

No, because I changed my position!  I now propose to make ASCII the
default conversion (i.e., characters must be in range(128) to avoid an
exception).  You are arguing for Latin-1 which gives them silent
errors.  I *was* arguing for UTF-8, which would give them likely but
not guaranteed errors.  I *am* now arguing for ASCII, which guarantees
them errors (if theyt are in fact using an encoding).

> >This is
> >more like ingoring overflow on integer addition (so that 2000000000*2
> >yields -2442450944).  I am against silently allowing erroneous results
> >like this if I can help it.
> 
> As I've explained before, such encoding issues are silent by nature.
> There's *nothing* you can ever do about it. The silent errors caused by
> defaulting utf-8 are far worse.

Which why I no longer argue for it.

> Hm, Moshe wrote:
> """I much prefer the Fredrik-Paul position, known also as the
> character is a character position, to the UTF-8 as default encoding.
> Unicode is western-centered -- the first 256 characters are Latin 1.
> """

And then proceeded to write: "If I'm reading Hebrew from an IS-8859-8
file, I'll set a conversion to Unicode on the fly anyway [...]".

> And Ping wrote:
> """If it turns out automatic conversions *are* absolutely necessary,
> then i vote in favour of the simple, direct method promoted by Paul
> and Fredrik: just copy the numerical values of the bytes.  The fact
> that this happens to correspond to Latin-1 is not really the point;
> the main reason is that it satisfies the Principle of Least Surprise.
> """
> 
> I thought that was pretty clear.

But he first proposed to have no conversions at all.  I am now
convinced that UTF-8 is bad, and that having no default conversion at
all is bad.  We need at least ASCII.  I claim that we need no more
than ASCII.  The reason is that Latin-1 is not a safe assumption;
ASCII is.  (Unless it's not characters at all -- but usually binary
goop contains more than a smattering of bytes in range(128, 256) so it
would typically be caught right away.)

> >Having no default encoding would be like having no automatic coercion
> >between ints and long ints -- I tried this in very early Python
> >versions (around 0.9.1 I believe) but Tim Peters and/or Steve Majewski
> >quickly dissuaded me of this bad idea.
> 
> 1. Currently utf-8 is the default. Many of us trying to dissuade you of
> this bad idea.

I agree.

> 2. You propose to *not* provide a default encoding for characters >= 128

Correct.

> 3. Many of us trying to dissuade you of this bad idea.

So far you're the only one -- I haven't seen other responses to this
idea yet.

> (Too bad none of us is called Tim or Steve, or you would've been convinced
> a long time ago ;-)
> 
> One additional fact: 8-bit encodings exist that are not even compatible
> with 7-bit ascii, making the choice to only compare if it's 7-bit ascii
> look even more arbitrary.

But there's a compelling argument that *requires* ASCII (see previous
post), and encodings that are not a superset of ASCII are rare.

> Guido, maybe you'll believe it from you loving little brother: Guido's are
> not always right ;-)

But they listen to reason.  I've been convinced that UTF-8 is bad.
I'm not convinced that Latin-1 is good, and I'm proposing what I think
is a very Pythonic compromise: ASCII, on which we (nearly) all can
agree.

> >[Just]
> >> You're going to have a hard time explaining that "\377" != u"\377".
> >
> [GvR]
> >I agree.  You are an example of how hard it is to explain: you still
> >don't understand that for a person using CJK encodings this is in fact
> >the truth.
> 
> That depends on the definition of truth: it you document that 8-bit strings
> are Latin-1, the above is the truth. Conceptually classify all other 8-bit
> encodings as binary goop makes the semantics chrystal clear.
[and later]
> Oops, I meant of course that "\377" == u"\377" is then the truth...

I can document that 1==2 but that doesn't make it true.  Since we
canhave binary goop in 8-bit strings, 8-bit strings are NOT always
Latin-1.  At least until Python 3000.

Think about it once more.  Why do you really want Latin-1?

--Guido van Rossum (home page: http://www.python.org/~guido/)