[Python-Dev] PEP 263 -- Python Source Code Encoding

26 Feb 2002 22:36:58 +0100

Guido van Rossum <guido@python.org> writes:

> > This makes Latin-1 the right choice:
> > 
> > * Unicode literals already use it today
> 
> But they shouldn't, IMO.

I agree. I recommend to deprecate this feature, and raise a
DeprecationWarning if a Unicode literal contains non-ASCII characters
but no encoding has been declared.

> Sorry, I don't understand what you're trying to say here.  Can you
> explain this with an example?  Why can't we require any program
> encoded in more than pure ASCII to have an encoding magic comment?  I
> guess I don't understand why you mean by "raw binary".

With the proposed implementation, the encoding declaration is only
used for Unicode literals. In all other places where non-ASCII
characters can occur (comments, string literals), those characters are
treated as "bytes", i.e. it is not verified that these bytes are
meaningful under the declared encoding.

Marc's original proposal was to apply the declared encoding to the
complete source code, but I objected claiming that it would make the
tokenizer changes more complex, and the resulting tokenizer likely
significantly slower (atleast if you use the codecs API to perform the
decoding).

In phase 2, the encoding will apply to all strings. So it will not be
possible to put arbitrary byte sequences in a string literal, atleast
if the encoding disallows certain byte sequences (like UTF-8, or
ASCII). Since this is currently possible, we have a backwards
compatibility problem.

Regards,
Martin