PEP 263 comments
M.-A. Lemburg
mal at lemburg.com
Tue Feb 26 05:06:45 EST 2002
"Martin v. Loewis" wrote:
>
> To make some progress on PEP 263, I suggest that some of the open issues
> are resolved as follows:
Thanks for the comments. I've update the PEP at SourceForge...
> - Comment syntax: I suggest to use the form
> -*- coding: <coding name> -*-
> Emacs already recognizes this syntax, as does patch #508973
> on IDLEfork. The other proposed syntaxes should be removed from the
> PEP.
+1
> - In addition, to simplify usage on Windows, Python recognizes the
> UTF-8 file signature (e.g. as generated by notepad). Any file
> starting with \xef\xbb\xbf is treated as being UTF-8; a coding
> comment different from "utf-8" in such a file is an error.
+1
> - identifiers remain restricted to ASCII
+1
> - Implementation strategy: I believe the proposed strategy (change the
> tokenizer) is overly complicated, and likely inefficient. Instead, I
> suggest that the encoding directive applies only to Unicode literals.
> It will still be formally an error if comments or string literals do
> not follow the declared encoding, but the Python parser won't detect
> this error.
>
> For use in Unicode literals, the parser will continue to work as it
> does now, except that it applies the declared coding in compile.c.
> To do so, PyUnicode_DecodeRawUnicodeEscape and
> PyUnicode_DecodeUnicodeEscape will expect an additional flag
> indicating whether they operate on a char* or a Py_UNICODE*.
>
> The only problem with this approach is that encodings where " or '
> could be the second byte of a multi-byte character cannot be
> supported as a source encoding. Python supports no such encoding
> in the standard library at the moment, anyway, so this should not
> be a problem.
I've added a two phase approach to the PEP: first we only
handle Unicode literals, then we do the whole file in a later
step.
> - Backwards compatibility: I'm in favour of leaving mostly everything
> as-is, i.e. if there is no declared encoding, it should be possible
> to put arbitrary bytes in string literals and comments; the proposed
> implementation strategy supports that. However, I think that Unicode
> literals which use the Latin-1 fallback should be deprecated, and that
> the implementation should raise a DeprecationWarning: Anybody relying
> on that feature should declare that the encoding is Latin-1.
Python will have to use Latin-1 as fallback encoding anyway,
so I don't think it's worth the trouble...
> - Changes to IDLE: When IDLE opens a file, it shall look for the UTF-8
> signature. If no UTF-8 signature is found, it shall look for the
> coding comment. If none is found, it shall apply the locale's
> coding, which is determined as follows:
> - on windows, it is "mbcs"
> - on Unix, it is the one returned by nl_langinfo(CODESET)
> Otherwise, it is the system default encoding.
>
> When saving a file, IDLE shall preserve the UTF-8 signature if there
> was one. If not, and if there is a coding comment, that should be
> used to encode the file. If there is none, the locale's encoding
> should be used. If encoding fails (whether the coding was found in
> the comment or in the locale), the file shall be UTF-8 encoded, and
> an UTF-8 signature added.
I did not add the IDLE changes to the PEP. Please upload them
as feature request to SF.
Thanks,
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/
More information about the Python-list
mailing list