[I18n-sig] Re: [Python-Dev] Unicode debate
Guido van Rossum
guido@python.org
Tue, 02 May 2000 10:53:19 -0400
> [GvR]
> >The big deal is that in some cultures, 8-bit strings with non-ASCII
> >bytes are unlikely to be Latin-1. Under the Latin-1 convention, they
> >would get garbage when mixing Unicode and regular strings.
[Just]
> They would also get garbage under the utf-8 convention, so again, a
> moot point.
No, because I changed my position! I now propose to make ASCII the
default conversion (i.e., characters must be in range(128) to avoid an
exception). You are arguing for Latin-1 which gives them silent
errors. I *was* arguing for UTF-8, which would give them likely but
not guaranteed errors. I *am* now arguing for ASCII, which guarantees
them errors (if theyt are in fact using an encoding).
> >This is
> >more like ingoring overflow on integer addition (so that 2000000000*2
> >yields -2442450944). I am against silently allowing erroneous results
> >like this if I can help it.
>
> As I've explained before, such encoding issues are silent by nature.
> There's *nothing* you can ever do about it. The silent errors caused by
> defaulting utf-8 are far worse.
Which why I no longer argue for it.
> Hm, Moshe wrote:
> """I much prefer the Fredrik-Paul position, known also as the
> character is a character position, to the UTF-8 as default encoding.
> Unicode is western-centered -- the first 256 characters are Latin 1.
> """
And then proceeded to write: "If I'm reading Hebrew from an IS-8859-8
file, I'll set a conversion to Unicode on the fly anyway [...]".
> And Ping wrote:
> """If it turns out automatic conversions *are* absolutely necessary,
> then i vote in favour of the simple, direct method promoted by Paul
> and Fredrik: just copy the numerical values of the bytes. The fact
> that this happens to correspond to Latin-1 is not really the point;
> the main reason is that it satisfies the Principle of Least Surprise.
> """
>
> I thought that was pretty clear.
But he first proposed to have no conversions at all. I am now
convinced that UTF-8 is bad, and that having no default conversion at
all is bad. We need at least ASCII. I claim that we need no more
than ASCII. The reason is that Latin-1 is not a safe assumption;
ASCII is. (Unless it's not characters at all -- but usually binary
goop contains more than a smattering of bytes in range(128, 256) so it
would typically be caught right away.)
> >Having no default encoding would be like having no automatic coercion
> >between ints and long ints -- I tried this in very early Python
> >versions (around 0.9.1 I believe) but Tim Peters and/or Steve Majewski
> >quickly dissuaded me of this bad idea.
>
> 1. Currently utf-8 is the default. Many of us trying to dissuade you of
> this bad idea.
I agree.
> 2. You propose to *not* provide a default encoding for characters >= 128
Correct.
> 3. Many of us trying to dissuade you of this bad idea.
So far you're the only one -- I haven't seen other responses to this
idea yet.
> (Too bad none of us is called Tim or Steve, or you would've been convinced
> a long time ago ;-)
>
> One additional fact: 8-bit encodings exist that are not even compatible
> with 7-bit ascii, making the choice to only compare if it's 7-bit ascii
> look even more arbitrary.
But there's a compelling argument that *requires* ASCII (see previous
post), and encodings that are not a superset of ASCII are rare.
> Guido, maybe you'll believe it from you loving little brother: Guido's are
> not always right ;-)
But they listen to reason. I've been convinced that UTF-8 is bad.
I'm not convinced that Latin-1 is good, and I'm proposing what I think
is a very Pythonic compromise: ASCII, on which we (nearly) all can
agree.
> >[Just]
> >> You're going to have a hard time explaining that "\377" != u"\377".
> >
> [GvR]
> >I agree. You are an example of how hard it is to explain: you still
> >don't understand that for a person using CJK encodings this is in fact
> >the truth.
>
> That depends on the definition of truth: it you document that 8-bit strings
> are Latin-1, the above is the truth. Conceptually classify all other 8-bit
> encodings as binary goop makes the semantics chrystal clear.
[and later]
> Oops, I meant of course that "\377" == u"\377" is then the truth...
I can document that 1==2 but that doesn't make it true. Since we
canhave binary goop in 8-bit strings, 8-bit strings are NOT always
Latin-1. At least until Python 3000.
Think about it once more. Why do you really want Latin-1?
--Guido van Rossum (home page: http://www.python.org/~guido/)