[I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model

Paul Prescod paulp@ActiveState.com
Wed, 07 Feb 2001 12:13:48 -0800


"Martin v. Loewis" wrote:
> 
> > ...
> > XXX is a series of non-ASCII bytes. Those are mapped into Unicode
> > characters with the same ordinals. Now you write them to a file. You
> > presumably do not specify an encoding on the file write operation. So
> > the characters get mapped back to bytes with the same ordinals. It all
> > behaves as it did in Python 1.0 ...
> 
> They don't write them to a file. Instead, they print them in the IDLE
> terminal, or display them in a Tk or PythonWin window. Both support
> arbitrary many characters, and will treat the bytes as characters
> originating from Latin-1 (according to their ordinals).

I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data
in a string literal. PythonWin and Tk expect Unicode. How could they
display the characters correctly?

> Or, they pass them as attributes in a DOM method, which, on
> write-back, will encode every string as UTF-8 (as that is the default
> encoding of XML). Then the characters will get changed, when they
> shouldn't.

What do you think *should* happen? These are the only choices I can
think of:

 1. DOM encodes it as UTF-8
 2. DOM blindly passes it through and creates illegal XML
 3. (correct) User explicitly decodes data into Unicode charset.

3) is unchanged today and under my proposal. You've got some bytes.
Python doesn't know what you mean. The only way to let it know what you
mean is to decode it.

>...
> Legacy code will pass them to applications that know to operate with
> the full Unicode character set, e.g. by applying encodings where
> necessary, or selecting proper fonts (which might include applying
> encodings). *That* is where it will break, and the library has no way
> of telling whether the strings where meant as byte strings (in an
> unspecified character set), or as Unicode character strings.

The only sane thing to do when you don't know is to pass the characters
as-is, char->ord->char.

> > It isn't the appropriate time to create such a core code patch. I'm
> > trying to figure out our direction so that we can figure out what can be
> > done in the short term. The only two things I can think of are merge
> > chr/unichr (easy) and provide encoding-smart alternatives to open() and
> > read() (also easy). The encoding-smart alternatives should also be
> > documented as preferred replacements as soon as possible.
> 
> I'm not sure they are preferred. They are if you know the encoding of
> your data sources. If you don't, you better be safe than sorry.

If you don't know the encoding of your data sources then you should say
that explicitly in code rather than using the same functions as people
who *do* know what their encoding is. Explicit is better than implicit,
right? Our current default is totally implicit.

 Paul Prescod