[I18n-sig] Re: [Python-Dev] Unicode debate

Just van Rossum just@letterror.com
Tue, 2 May 2000 16:38:51 +0100


[GvR]
>Why not Latin-1?  Because it gives us Western-alphabet users a false
>sense that our code works, where in fact it is broken as soon as you
>change the encoding.

[Just]
> Yeah, and? It least it'll *show* it's broken instead of *silently* doing
> the wrong thing with utf-8.
>
> It's like using Python ints all over the place, and suddenly a user of the
> application enters data that causes an integer overflow. Boom. Program
> needs to be fixed. What's the big deal?

[GvR]
>The big deal is that in some cultures, 8-bit strings with non-ASCII
>bytes are unlikely to be Latin-1.  Under the Latin-1 convention, they
>would get garbage when mixing Unicode and regular strings.

They would also get garbage under the utf-8 convention, so again, a moot point.

>This is
>more like ingoring overflow on integer addition (so that 2000000000*2
>yields -2442450944).  I am against silently allowing erroneous results
>like this if I can help it.

As I've explained before, such encoding issues are silent by nature.
There's *nothing* you can ever do about it. The silent errors caused by
defaulting utf-8 are far worse.

>[Just, in a different message]
>> Of course it's not, and of course you shouldn't be counting votes. However,
>> the fact that more and more people chime in on the Latin-1 side (even
>> non-western oriented people like Ping and Moshe!) should ring a bell.
>
>Significantly, neither Ping nor Moshe cares for Latin-1 at all: they
>don't have a use for a default encoding.  This is because they have no
>hope that their preferred encoding would be elected as the default
>encoding.

Hm, Moshe wrote:
"""I much prefer the Fredrik-Paul position, known also as the
character is a character position, to the UTF-8 as default encoding.
Unicode is western-centered -- the first 256 characters are Latin 1.
"""

And Ping wrote:
"""If it turns out automatic conversions *are* absolutely necessary,
then i vote in favour of the simple, direct method promoted by Paul
and Fredrik: just copy the numerical values of the bytes.  The fact
that this happens to correspond to Latin-1 is not really the point;
the main reason is that it satisfies the Principle of Least Surprise.
"""

I thought that was pretty clear.

>Having no default encoding would be like having no automatic coercion
>between ints and long ints -- I tried this in very early Python
>versions (around 0.9.1 I believe) but Tim Peters and/or Steve Majewski
>quickly dissuaded me of this bad idea.

1. Currently utf-8 is the default. Many of us trying to dissuade you of
this bad idea.
2. You propose to *not* provide a default encoding for characters >= 128
3. Many of us trying to dissuade you of this bad idea.

(Too bad none of us is called Tim or Steve, or you would've been convinced
a long time ago ;-)

One additional fact: 8-bit encodings exist that are not even compatible
with 7-bit ascii, making the choice to only compare if it's 7-bit ascii
look even more arbitrary.

Guido, maybe you'll believe it from you loving little brother: Guido's are
not always right ;-)

Just