PEP: Defining Python Source Code Encodings

Tue Jul 17 07:08:02 EDT 2001

On Tue, 17 Jul 2001, M.-A. Lemburg wrote:

> After having been through two rounds of comments with the "Unicode
> Literal Encoding" pre-PEP, it has turned out that people actually
> prefer to go for the full Monty meaning that the PEP should handle
> the complete Python source code encoding and not just the encoding
> of the Unicode literals (which are currently the only parts in a
> Python source code file for which Python assumes a fixed encoding).
> 
> Here's a summary of what I've learned from the comments:
> 
> 1. The complete Python source file should use a single encoding.

Yes, certainly

> 2. Handling of escape sequences should continue to work as it does 
>    now, but with all possible source code encodings, that is
>    standard string literals (both 8-bit and Unicode) are subject to 
>    escape sequence expansion while raw string literals only expand
>    a very small subset of escape sequences.
> 
> 3. Python's tokenizer/compiler combo will need to be updated to
>    work as follows:
> 
>    1. read the file
>    2. decode it into Unicode assuming a fixed per-file encoding
>    3. tokenize the Unicode content
>    4. compile it, creating Unicode objects from the given Unicode data
>       and creating string objects from the Unicode literal data
>       by first reencoding the Unicode data into 8-bit string data
>       using the given file encoding

I think, that if encoding is not given, it must sillently assume "UNKNOWN"
encoding and do nothing, that is be 8-bit clean (as it is now).

Otherwise, it will slow down parser considerably.

I also think that if encoding is choosen, there is no need to reencode it
back to literal strings: let them be in Unicode.

Or the encoding must _always_ be ASCII+something, as utf-8 for example.
Eliminating the need to bother with tokenizer (Because only docstrings,
comments and string-literals are entities which require encoding /
decoding).

If I understood correctly, Python will soon switch to "unicode-only"
strings, as Java and Tcl did. (This is of course disaster for some Python
usage areas such as fast text-processing, but...)

Or am I missing something?

>    To make this backwards compatible, the implementation would have to
>    assume Latin-1 as the original file encoding if not given (otherwise,
>    binary data currently stored in 8-bit strings wouldn't make the
>    roundtrip).

...as I said, there must be no assumed charset. Things must
be left as is now when no explicit encoding given.

> 4. The encoding used in a Python source file should be easily
>    parseable for en editor; a magic comment at the top of the
>    file seems to be what people want to see, so I'll drop the
>    directive (PEP 244) requirement in the PEP.
> 
> Issues that still need to be resolved:
> 
> - how to enable embedding of differently encoded data in Python
>   source code (e.g. UTF-8 encoded XML data in a Latin-1
>   source file)

Probably, adding explicit conversions.

> - what to do with non-literal data in the source file, e.g.
>   variable names and comments:
> 
>   * reencode them just as would be done for literals
>   * only allow ASCII for certain elements like variable names
>   etc.

I think non-literal data must be in ASCII.
But it could be too cheesy to have variable names in national
alphabet ;-)

> - which format to use for the magic comment, e.g.
> 
>   * Emacs style:
> 
>       #!/usr/bin/python
>       # -*- encoding = 'utf-8' -*-
> 
>   * Via meta-option to the interpreter:
> 
>       #!/usr/bin/python --encoding=utf-8
> 
>   * Using a special comment format:
> 
>       #!/usr/bin/python
>       #!encoding = 'utf-8'

No variant is ideal. The 2nd is worse/best than all
(it depends on how to look at it!)

Python has no macro directives. In this situation 
they could help greatly!

That "#!encoding" is special case of macro directive.

May be just put something like ''# <!DOCTYPE HTML PUBLIC''
at the beginning...

Or, even greater idea occured to me: allow some XML
with meta-information (not only encoding) somehow escaped.

I think, GvR could come with some advice here...

> Comments are welcome !

Sincerely yours, Roman A.Suzi
-- 
 - Petrozavodsk - Karelia - Russia - mailto:rnd at onego.ru -