[I18n-sig] Re: Python Character Model

Paul Prescod paulp@ActiveState.com
Wed, 07 Feb 2001 19:04:50 -0800


"Martin v. Loewis" wrote:
> 
> >
> > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data
> > in a string literal. PythonWin and Tk expect Unicode. How could they
> > display the characters correctly?
> 
> No, PythonWin and Tk both tell apart Unicode and byte strings
> (although Tk uses quite a funny algorithm to do so). If they see a
> byte string, they convert it using the platform encoding (which is
> user-settable on both Windows and Unix) to a Unicode string, and
> display that.

And if they read in a file from a Frenchmen then they get random Russian
characters on their screen. Or they crash the third-party software
because it couldn't decode properly. Or ...

This is what we need to move away from. The first step is to get people
to stop accidently passing around character strings as byte strings. To
do that we need to make it as easy as possible to get properly decoded
strings into Python.

> > ...
> > What do you think *should* happen? These are the only choices I can
> > think of:
> >
> >  1. DOM encodes it as UTF-8
> >  2. DOM blindly passes it through and creates illegal XML
> >  3. (correct) User explicitly decodes data into Unicode charset.
> 
> What users expect to happen is 2; blindly pass-through. They think
> they can get it right; given enough control, this is feasible. It was
> even common practice in the absence of Unicode objects, so a lot of
> code depends on libraries passing things through as-is.

Surely you agree with me that it is inappropriate for a user to *expect*
a DOM implementation to pass on binary data unmolested. That some
particular DOM may do so (like minidom) is probably just a performance
optimizatoin quirk that could go away at any time. Why would we go out
of our way to support people making this mistake?

> > If you don't know the encoding of your data sources then you should say
> > that explicitly in code rather than using the same functions as people
> > who *do* know what their encoding is. Explicit is better than implicit,
> > right? Our current default is totally implicit.
> 
> No, it's not. The current default is: always produce byte strings. 

A "byte string" is not something you'll find defined in the Python
tutorial, language reference or library reference. People who use open()
do not know that they are making a choice. If you ask a hundred Python
programmers whether the result of open() is a character stream or a byte
stream, most will say character stream. The same goes for string
literals.

The section of the Python language reference describing string literals
does not mention the word "byte" once. It mentions the world character
on almost every other line.

> In
> many applications, people certainly *should* use character strings,
> but they have to change their code for that. Telling everybody to use
> fopen for everything is wrong; telling them to use codecs.open for
> character streams is right.

In another message you admitted that the codec mechanism is somewhat
user unfriendly...so I hope we agree that we need something better.
People need to start making a choice and we have to make that as easy
for them as possible!

 Paul Prescod