PEP 263 comments
Martin v. Loewis
martin at v.loewis.de
Mon Feb 25 00:20:20 EST 2002
To make some progress on PEP 263, I suggest that some of the open issues
are resolved as follows:
- Comment syntax: I suggest to use the form
-*- coding: <coding name> -*-
Emacs already recognizes this syntax, as does patch #508973
on IDLEfork. The other proposed syntaxes should be removed from the
PEP.
- In addition, to simplify usage on Windows, Python recognizes the
UTF-8 file signature (e.g. as generated by notepad). Any file
starting with \xef\xbb\xbf is treated as being UTF-8; a coding
comment different from "utf-8" in such a file is an error.
- identifiers remain restricted to ASCII
- Implementation strategy: I believe the proposed strategy (change the
tokenizer) is overly complicated, and likely inefficient. Instead, I
suggest that the encoding directive applies only to Unicode literals.
It will still be formally an error if comments or string literals do
not follow the declared encoding, but the Python parser won't detect
this error.
For use in Unicode literals, the parser will continue to work as it
does now, except that it applies the declared coding in compile.c.
To do so, PyUnicode_DecodeRawUnicodeEscape and
PyUnicode_DecodeUnicodeEscape will expect an additional flag
indicating whether they operate on a char* or a Py_UNICODE*.
The only problem with this approach is that encodings where " or '
could be the second byte of a multi-byte character cannot be
supported as a source encoding. Python supports no such encoding
in the standard library at the moment, anyway, so this should not
be a problem.
- Backwards compatibility: I'm in favour of leaving mostly everything
as-is, i.e. if there is no declared encoding, it should be possible
to put arbitrary bytes in string literals and comments; the proposed
implementation strategy supports that. However, I think that Unicode
literals which use the Latin-1 fallback should be deprecated, and that
the implementation should raise a DeprecationWarning: Anybody relying
on that feature should declare that the encoding is Latin-1.
- Changes to IDLE: When IDLE opens a file, it shall look for the UTF-8
signature. If no UTF-8 signature is found, it shall look for the
coding comment. If none is found, it shall apply the locale's
coding, which is determined as follows:
- on windows, it is "mbcs"
- on Unix, it is the one returned by nl_langinfo(CODESET)
Otherwise, it is the system default encoding.
When saving a file, IDLE shall preserve the UTF-8 signature if there
was one. If not, and if there is a coding comment, that should be
used to encode the file. If there is none, the locale's encoding
should be used. If encoding fails (whether the coding was found in
the comment or in the locale), the file shall be UTF-8 encoded, and
an UTF-8 signature added.
Regards,
Martin
More information about the Python-list
mailing list