[I18n-sig] Unicode debate

Just van Rossum just@letterror.com
Tue, 2 May 2000 14:39:24 +0100


At 1:42 AM -0700 02-05-2000, Ka-Ping Yee wrote:
>If it turns out automatic conversions *are* absolutely necessary,
>then i vote in favour of the simple, direct method promoted by Paul
>and Fredrik: just copy the numerical values of the bytes.  The fact
>that this happens to correspond to Latin-1 is not really the point;
>the main reason is that it satisfies the Principle of Least Surprise.

Exactly.

I'm not sure if automatic conversions are absolutely necessary, but seeing
8-bit strings as Latin-1 encoded Unicode strings seems most natural to me.
Heck, even 8-bit strings should have an s.encode() method, that would
behave *just* like u.encode(), and unicode(blah) could even *return* an
8-bit string if it turns out the string has no character codes > 255!

Conceptually, this gets *very* close to the ideal of "there is only one
string type", and at the same times leaves room for 8-bit strings doubling
as byte arrays for backward compatibility reasons.

(Unicode strings and 8-bit strings could even be the same type, which only
uses wide chars when neccesary!)

Just