[Python-Dev] PEP 263 -- Python Source Code Encoding

Martin v. Loewis martin@v.loewis.de
02 Mar 2002 10:05:28 +0100


"Jason Orendorff" <jason@jorendorff.com> writes:

> I gather that "coding:" is supposed to specify the
> encoding (what MIME calls "charset") of the file.
> But under PEP 263, it only refers to the Unicode string
> literals within the program.  Everything else must still
> be treated as 8-bit text.

Not really. If you are willing to separate the language and its
implementation, then I'd phrase the intent that way:
- if an encoding is declared, all of the file must follow that
  encoding (all of them, always (*))
- in phase 1, the implementation will not verify that property, 
  except for Unicode literals
- in phase 2, Python will implement Python completely in this
  respect.

> For example, I'm not sure what effect "coding: utf-16"
> would have.  (?)

Invalid; source encodings must be an ASCII superset (not sure how the
implementation will react to that; if the file really is UTF-16,
you'll get a syntax error, if you say it is UTF-16 but it isn't,
Python will reject it in phase 2).

> For another example, if you have UTF-8 Unicode string
> literals in your program but you also have 8-bit
> Latin-1 plain str string literals in the same program,
> how should you mark it?  

You should mark the file as UTF-8. In phase 2, Python will reject it.
At that point, you should convert your latin-1 string literal into
hex escapes - it is binary data then, not Latin-1.

> How will Emacs then treat it?

Don't know - just try. You cannot create such a file with Emacs.

> Is a Python program an 8-bit string or a Unicode string?

>From the viewpoint of the language definition, it is a character
string. Quoting the C++ standard "how source files are mapped to the
source character set is implementation-defined".

Python (the language definition) actually does define it, by means of
PEP 263 (**). The source character set is Unicode, which does not
necessarily mean implementations have to represent source as Unicode
strings internally - they could also use the on-disk representation,
as long as the implementation behaves "as-if" it did perform the
mapping to Unicode.

> Right now, although perhaps someone who knows more about
> the parser than I can expand on this, it seems that
> Python programs are 8-bit strings.  

That's correct, although the language definition explicitly says that
usage of bytes above 128 is undefined. So Python programs, from the
point of the language definition, are ASCII strings.

> Therefore I argue that it makes no sense to use "coding:" to label a
> Python file, because the file doesn't consist of Unicode text.

You need to distinguish between the file on disk, and the text
processed by the parser (something that the current parser doesn't do,
except for line endings). This PEP proposes to change the way how it
is currently done. If there was no change, it would not be a "Python
Enhancement Proposal"

Regards,
Martin

(*) If no encoding is declared, they must follow the system encoding.
(**) The list of accepted source encodings remains
implementation-defined; each Python release should spell out its list
of supported encodings.