[I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Mon, 01 May 2000 11:10:21 -0400


Sin Hang Kim writes:

> I am not quite follow on the discussion. But I am interested in Unicode-ify
> python:

Thanks for participation -- we need more input from people who will
actually use the non-Western Unicode planes!

> Python should be able to be an native language of any language. For given
> all  nations a fair ground for computer programming. The recently
> english-oriented python syntax should be easily ported to other languages
> and python programs written in all languages can be converted to another one
> automatically. i.e., a french speaking children can use french command words
> to write python code, and this python code can convert to Englihs, Chinese,
> ...

I think that's a Python 3000 issue...  This would currently be very
difficult to add to the implementation.  Plus, I worry that it will
prevent free exchange of code across (language) borders: if you write
your program in "Chinese Python", most people in most other countries
won't be able to use it.  (This has been tried long ago for French
Pascal, and it wasn't a big suggess; my guess is for this reason.)

> Backward compatibility is a must. The current implementation of unicode
> string might break some code. The ability to convert from/to unicode is not
> enough. For example, it might for a search engine to collect many text from
> different encoding, and I have seen that mixed encoding in a single text. I
> did it once with in a Chinese application, I received a collective text file
> which someone who collect them from mainland China with GB encoding and
> locally with Big-5 encoding. The one who collect them do not read them
> carefully, and he got a mighty environment (richwin) which automatically
> recognize the encoding and adapt to it. So he just paste all these text
> together. With such an mixed text, no conversion to/from unicode handling is
> able to handle. Think if you run a mailing list, one like this, with people
> quoting each other's message and write in their native encoding, you will
> get a funny text collection with different encoding. This also can happen to
> the digest of such an mailing list: you may try now writing in all encoding
> :)

Of course, Unicode could also *help* -- all messages could be
translated from their original encoding to Unicode, and the digest
could be sent out in UTF-8.

> So, I perfer to have people choosing their encoding. Setting a flag inside a
> program will switch the internal handling of utf-8, 8-bit code. With time
> pass, we may drop that, but now, we can not abandom the 8-bit code.

Absolutely.  The problem you sketch (one file with multiple encodings)
can be handled by a Python program that takes control of the
encodings: for example, the program could read the file a line at a
time (or whatever unit is appropriate) and translate each line
according to the most appropriate encoding (as determined by context).

--Guido van Rossum (home page: http://www.python.org/~guido/)