[Python-Dev] PEP 263 - default encoding

Stephen J. Turnbull stephen@xemacs.org
18 Mar 2002 19:09:14 +0900


>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes:

    Martin> "Stephen J. Turnbull" <stephen@xemacs.org> writes:

    >> The parser accepts programs encoded in unicode.

    Martin> I still can't see how this is different from what the PEP
    Martin> says.

The PEP says "This PEP proposes to introduce a syntax to declare the
encoding of a Python source file. The encoding information is then
used by the Python parser to interpret the file using the given
encoding." and "I propose to make the Python source code encoding both
visible and changeable on a per-source file basis".  That strongly
suggests to me that it's Python's job to list, define, and implement
the acceptable codings.

It claims to "provide ... a more robust and portable definition."  Of
what is not explicitly stated; I interpret it to mean the definition
of legal encodings of Python source code.  I doubt I'll be the only
one.  And I think that's really what you have in mind, anyway.  Your
comment about "who cares if the sun doesn't set" certainly suggests
that.

    Martin> With the PEP, people can write source code in different
    Martin> encodings, but any problems they get are their problems.

Where does it say that?  The current language in the PEP suggests
quite the opposite to me.  Basically this PEP is designed to
facilitate non-portable, non-interoperable programming styles.  I see
the need, but I think it's regrettable.

As written, the PEP never explicitly says "we won't support most of
the infinite variety of ways to hurt yourself that this facility
provides."  I think users will start by expecting it to support the
ones they're addicted to, then complain when it fails.  That's
certainly the experience with Emacs.

    Martin> It is traditional Python policy not to take side on
    Martin> political debates. If this sun does not set, what is the
    Martin> problem?

Nothing, if you don't see barriers to interoperability and reuse of
code as a problem.

    >> o I think it makes it hard to implement helper tools (eg
    >> python-mode).

    Martin> Harder than with those hooks?

Yes.  Because ordinary string literals must be handled specially.  As
I pointed out, a good Emacs implementation will ignore the coding
cookies on Emacs input; python-mode will have to lex the buffer
itself.  (Or undo the transformation for literal strings, assuming it
can.)

    >> And how is he going to use regexps or formatting sugar without
    >> literal UTF-16 strings?

    Martin> In stage 1 of the implementation, he can use either UTF-8
    Martin> or EUC-JP in Unicode literals.

Assuming he's willing to use Unicode literals.  Maybe for good or bad
reasons he really wants ordinary strings.

    Martin> I'm not sure I can follow this example. If you put byte
    Martin> 185 into a Python source code file, and you declare the
    Martin> file as Latin-2, what does that have to do with the
    Martin> locale? PEP 263 never mentions use of the locale for
    Martin> anything.

I apologize for the reference to locale; that was incorrect.  I meant
there's a good chance the file will have a Latin-2 cookie.

    >> This can be made safe by not decoding the contents of ordinary
    >> string literals, but that requires that the parser has to do
    >> the lexing, you can't delegate it to a general-purpose codec.

    Martin> Why is that? If the declared encoding of the file is
    Martin> Latin-2, the parser will convert it into Unicode, then
    Martin> parse it, then reconvert byte strings into Latin-2.

This _probably_ works.  However, in the text quoted above, I wrote
"by not decoding the contents of ordinary string literals", and that
cannot be done by a general-purpose codec.

IMHO, the parser should never need to call a codec.  For text, we can
generally rely on codecs to provide encoders and decoders that are
inverses; not so for binary.  This is just not safe, as you admit.

    Martin> Breakage won't be silent, though. People will get a
    Martin> warning in phase 1, so they will know to declare an
    Martin> encoding.

Which they will see on the majority of their files, almost all of
which will work despite the warning.  People who hate warnings will
turn them off by automatically adding the cookie to all programs.
Others will ignore them, and maybe remember them when they hit a bug.

    Martin> That is indeed a problem - those byte strings would have
    Martin> different values at run-time. I expect that most users
    Martin> will accept the problem, since the strings still have
    Martin> their original "meaning".

If they are using ordinary strings correctly (ie, not for containing
text), this is out and out data corruption.  True, they should be
using octal or hex escapes.  But I bet there's lots of code out there
that doesn't; I know there's tons in Emacs Lisp.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.