PEP 263 comments

Mon Feb 25 00:20:20 EST 2002

To make some progress on PEP 263, I suggest that some of the open issues
are resolved as follows:

- Comment syntax: I suggest to use the form
  -*- coding: <coding name> -*-
  Emacs already recognizes this syntax, as does patch #508973 
  on IDLEfork. The other proposed syntaxes should be removed from the
  PEP.

- In addition, to simplify usage on Windows, Python recognizes the
  UTF-8 file signature (e.g. as generated by notepad). Any file
  starting with \xef\xbb\xbf is treated as being UTF-8; a coding
  comment different from "utf-8" in such a file is an error.

- identifiers remain restricted to ASCII

- Implementation strategy: I believe the proposed strategy (change the
  tokenizer) is overly complicated, and likely inefficient. Instead, I
  suggest that the encoding directive applies only to Unicode literals.
  It will still be formally an error if comments or string literals do
  not follow the declared encoding, but the Python parser won't detect
  this error. 

  For use in Unicode literals, the parser will continue to work as it
  does now, except that it applies the declared coding in compile.c.
  To do so, PyUnicode_DecodeRawUnicodeEscape and
  PyUnicode_DecodeUnicodeEscape will expect an additional flag
  indicating whether they operate on a char* or a Py_UNICODE*.

  The only problem with this approach is that encodings where " or '
  could be the second byte of a multi-byte character cannot be
  supported as a source encoding. Python supports no such encoding
  in the standard library at the moment, anyway, so this should not
  be a problem.

- Backwards compatibility: I'm in favour of leaving mostly everything
  as-is, i.e. if there is no declared encoding, it should be possible
  to put arbitrary bytes in string literals and comments; the proposed
  implementation strategy supports that. However, I think that Unicode
  literals which use the Latin-1 fallback should be deprecated, and that
  the implementation should raise a DeprecationWarning: Anybody relying
  on that feature should declare that the encoding is Latin-1.

- Changes to IDLE: When IDLE opens a file, it shall look for the UTF-8
  signature. If no UTF-8 signature is found, it shall look for the
  coding comment. If none is found, it shall apply the locale's
  coding, which is determined as follows:
  - on windows, it is "mbcs"
  - on Unix, it is the one returned by nl_langinfo(CODESET)
  Otherwise, it is the system default encoding.

  When saving a file, IDLE shall preserve the UTF-8 signature if there
  was one. If not, and if there is a coding comment, that should be
  used to encode the file. If there is none, the locale's encoding
  should be used. If encoding fails (whether the coding was found in
  the comment or in the locale), the file shall be UTF-8 encoded, and
  an UTF-8 signature added.

Regards,
Martin