Python's 8-bit cleanness deprecated?

Sat Feb 8 09:10:57 EST 2003

Kirill Simonov wrote:
> * M.-A. Lemburg <mal at lemburg.com>:
> 
>>No, but they'll need to pay some lucky Python programmer to
>>get rid off the warning :-) Seriously, the warning and the trouble
>>are intended as I already mentioned in the bug report Kirill
>>filed on SF: http://www.python.org/sf/681960/ :
> 
> Sorry, but I'm not convinced. I hope you still have patience to 
> hear my objections.

Sure :-)

> I've inspected the current implementation. The file encoding does not
> affect ordinary string literals. At first the tokenizer converts them
> into UTF-8 from the file encoding. Then the compiler converts them back
> from UTF-8 to the file encoding.

The story goes like this:

binary file content using encoding ENC
-> via codec for ENC into Unicode
-> via UTF-8 codec into UTF-8 string
-> tokenizer
-> compiler
for 8-bit string literals in the source code
-> UTF-8 string is converted back into encoding ENC

Provided that the encoding ENC is roundtrip safe
for all 256 base character ordinals, 8-bit strings
will turn out as-is in the compiled byte code.

> Thus the result is the same regardless
> of what encoding you use. The comments are tossed out by the tokenizer
> too. Why do you want them to be in any particular encoding if their
> encoding doesn't matter?

The encoding matters because the complete file is passed to
the codec. Thus, comments in different encodings are likely going
to cause decoding errors.

> Well, I understand. The file encoding is defined for the whole file.
> So comments and string literals must be in this encoding too.
> And that way we can define Unicode literals using our favourite encoding.

Right.

> But what is the price that we pay for this? The millions of Python
> scripts that use 8-bit string literals or comments are broken now in
> order to allow the feature that no one ever used! I think that this is
> an extreme.

They are not broken. In Python 2.3 you get a warning telling
you to add the encoding header, later you'll get a SyntaxError.

I think it's in the spirit of Python to be explicit about
what you are doing. Up 'til now, Python did not officially
support non-ASCII text in Python source code. Unfortunately,
it didn't check this neither. As a result we now have scripts
using all sorts of encodings. The PEP and its implementation
(by Martin von Loewis) are aimed at steering this situation into
safe grounds and making the use of non-ASCII source official
with the use of the encoding header.

While I agree that this will cause some work, I don't think
that adding the encoding header to source files is all that
hard to do. Making things explicit will save you much more time
in the future than you have to spend now to fix the situation.

> And I can propose a perfect solution. If there are no defined encoding
> for a source file, assume that it uses a simple 8-bit encoding. Do not
> convert the file into UTF-8 in the tokenizer. And do not convert string
> literals in the compiler. Raise SyntaxError if a non-ASCII character is
> contained in a Unicode literal. We will even save a few CPU cycles
> for most Python source files using this approach.

The compiler is slow anyway, so the few cycles you save
here wouldn't be noticed down the road :-)

As I mentioned in a previous mail, you can have the same
situation by specifying that your source code is using
Latin-1 as encoding. This will never fail due to an
encoding error, because Latin-1 is a subset of Unicode
and you could even use multiple encodings in a single
file (which I strongly recommend *not* to do).

However, I think that specifying the true encoding
(and possibly fixing the cases where the encoding is
being misused) will go much further.

> I will write a patch if you agree with this solution.

I am not convinced yet. So far we have only discussed that
the PEP implementation will cause a little work to make
things explicit that were previously implicitly tolerated,
but never officially allowed.

We are not talking about complex code analysis here. The
required change involves one line at a very well defined
location (first line of the file or second line if a
shebang is used).

>>This whole thing is one more step in the direction of
>>explicit is better than implicit and opens up Python
>>for many more languages such as, for example, Asian
>>scripts.
> 
> If you need a pythonic quote, it is here
>     "Practicality beats purity"

That's not listed in the Zen of Python
(http://www.python.org/doc/Humor.html).
"Explicit is etter than implicit" is and that's what the PEP
is all about.

Seriously, I believe that another 10 years down the road
you'll thank us for using the "Now is better than never."
Zen phrase.

BTW, in a previous posting I was refering to using a UTF-16
BOM mark to avoid the encoding header -- that should have
been a UTF-8 BOM mark. The support for UTF-16 BOM marks is
there, but it is currently disabled.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/