PEP 263 comments

Fri Mar 1 02:29:43 EST 2002

"Stephen J. Turnbull" <stephen at xemacs.org> writes:

> IMO, the Python source code parser should never see any text data[1]
> that is not UTF-8 encoded.  If you want to submit Python programs to
> the parser that are not UTF-8 encoded, then it is your responsibility
> as the programmer to make sure they get translated into UTF-8 (eg, by
> the preprocessing hook) before the interpreter proper ever sees them.

That would cause surprises to users. They have a source program that says

# -*- coding: koi8-r -*-
print "some cyrillic text"

This currently works fine on their system; the text comes out on the
terminal just right. Now, Python would convert this text silently to
UTF-8 behind their backs, and the terminal would show just garbage.

In XEmacs, this is no problem: the "terminal" mostly is the *Messages*
buffer, and that would know that all text is UTF-8.

For Unicode strings, we indeed plan to make the transformation you
suggest (not to UTF-8, though): If you have a script that reads

# -*- coding: koi8-r -*-
print u"some cyrillic text"

then the string literal will be converted to the internal Unicode
type. How to print it is then another issue; you'll have to figure out
the encoding of the terminal - that is feasible in most cases. 

It is not feasible to do the same for arbitrary byte strings: You (the
Python interpreter) could not know whether the string is supposedly
UTF-8 encoded, and that conversion to the terminal's encoding is
needed, or whether the string is an arbitrary byte sequence, which is
intended to appear on the terminal as-is. "All byte strings are UTF-8"
is not going to work, since Python is used to operate on binary data
as well, and the bytes that make up a GIF file just aren't UTF-8.

> [1]  Ie, Python language or character text.  It might be convenient to
> have an octet-string primitive data type, in which you could put
> EUC-encoded Japanese or Java byte codes.  

The traditional "string" type is, in fact, a byte string type. Many
people use it still for character strings, since the Unicode type was
the later addition. Changing the string type to be a Unicode type was
not feasible since that would have broken many applications, in
particular C modules which expect that the internal representation of
the string type is char[].

> [2]  But I recommend against this.  Don't offer support for such; it's
> a time and effort sink, for little return.

There will be a simple form of auto-recognition: an UTF-8 signature
(i.e. a UTF-8 encoded BOM) at the beginning of a source file will be
treated as a clear indication that the file is UTF-8.

Regards,
Martin