[I18n-sig] Random thoughts on Unicode and Python

Paul Prescod paulp@ActiveState.com
Sat, 10 Feb 2001 20:01:02 -0800


Andy Robinson wrote:
> 
> ...
> 
> That's my concern, and the thing I want to poll people on.
> If Python "just works" for these users, and if we already offer
> Unicode strings and a good codec library for people to use when they
> want to, is there really a need to go further?

Let me point out again that while I don't want to discount the needs of
these people, the fact is that over here in the West we need to use
Unicode ourselves! I've already figured out how the Unicode works and
how it interacts with "ordinary strings" but I don't think that
everybody I hire to work at ActiveState should have to figure that out
themselves. Obviously the Unicode source file issue is separate but the
"Unicode as basic string literal" helps all of us.

In a year, a lot of my work will involve XML on a Unicode-enabled
operating system. I'll only have to think about 8-bit extended ASCII
because Python forces me to sometimes. Now I know most people are not
going to be moving to full Unicode as quickly as I am but that is the
future and we need to start laying the groundwork now.

>...
> (2) slightly corrupt data: Let's say you are dealing with files
> or database fields containing some truncated kanji.  If you
> use 8-bit-clean strings and no conversion, the data will not
> be corrupted or changed; if you try to magically convert
> it to Unicode you will get error messages or possibly even
> more corruption.

I think we've all agreed that Python should never, ever, magically
convert binary data to Unicode. I think that most people's fears about
Unicode are precisely that it will some day magically covert binary data
to Unicode. But we all agree that that should never happen.

Even in my original proposal when I said that the standard string should
be widened to Unicode, I never, ever, suggested that binary data should
be converted to Unicode. Rather I said that in some cases Unicode
characters could be a transport -- a representation layer -- for binary
data. Just as in some cases integers are a transport for characters or
(shudder pointers).

> Suddenly upgrading to a new version of Python where all
> your data undergoes invisible transformations to Unicode
> and back is going to cause trouble for quite a few people.

But I do not believe that anyone has ever suggested that! I understand
where the misunderstanding comes from but it is nevertheless a
misunderstanding.

 Paul Prescod