Python's 8-bit cleanness deprecated?

SUZUKI Hisao suzuki611 at oki.com
Wed Feb 12 06:05:42 EST 2003


Kirill Simonov <kirill_simonov at mail.ru> wrote:
> I've inspected the current implementation. The file encoding does not
> affect ordinary string literals. At first the tokenizer converts them
> into UTF-8 from the file encoding. Then the compiler converts them back
> from UTF-8 to the file encoding. Thus the result is the same regardless
> of what encoding you use. The comments are tossed out by the tokenizer
> too. 

I'm sorry to say, but the fact is:  if UTF-8, Latin-1, or
*nothing* specified, the compiler does not convert literals at
all.  Look into tok->encoding and tok->decoding_state in
tokenizer.c.

For good old scripts, no conversion happens.

> Why do you want them to be in any particular encoding if their
> encoding doesn't matter?

In some encodings, the parser does not work properly without
knowing them.  For example, Shift_JIS encoding exploits '\x5c'
as the second byte of several multi-byte characters.

This means that the current Python 2.2.* cannot parse Shift_JIS
properly.  Japanese users write scripts in EUC-JP or UTF-8, or
use specially-hacked-Shift_JIS-centric version of Python now.
Shift_JIS users will need some encoding declaration anyway if
they want to use non-hacked version of Python 2.3.

For EUC-JP and UTF-8 users in Japan, the situations are similar
to yours.

> And I can propose a perfect solution. If there are no defined encoding
> for a source file, assume that it uses a simple 8-bit encoding. Do not
> convert the file into UTF-8 in the tokenizer. And do not convert string
> literals in the compiler. Raise SyntaxError if a non-ASCII character is
> contained in a Unicode literal. We will even save a few CPU cycles
> for most Python source files using this approach.

> I will write a patch if you agree with this solution.

I like your solution.  Writing a patch will be easy since the
conversion skipping is already implemented ;-).  In fact, the
prototype of the PEP 263 implementation, which I posted to
sf.net last spring, was implemented as follows:

  This implementation behaves just as the normal Python 2.2.1c1 does
  if no other coding hints are given.  Thus it does not hurt anyone
  who gets his/her jobs done with Python now.  Note that it is
  strictly compatible with the PEP in that every program valid in the
  PEP is also valid in this implementation.

It had no warnings.  Now I think it may be the best to warn us
just against Unicode literals with non-ASCII characters.

-- SUZUKI Hisao





More information about the Python-list mailing list