PEP 263 comments

Fri Mar 1 01:39:42 EST 2002

Please note, you are correct---I missed the main point.  I'm not going
to address that point-by-point, unless you ask.  However, it's easy
enough to paraphrase my comments to apply to "are we going to mandate
UTF-8 for all source code?".  The following is a rough counter-
proposal to handle the main issues, framed correctly.

>>>>> "Martin" == Martin von Loewis <loewis at informatik.hu-berlin.de> writes:

    Martin> [U]nder the proposed change, Python would refuse to accept
    Martin> source code if it is not UTF-8 encoded. In turn, code that
    Martin> has a euc-jp comment in it and is now happily accepted as
    Martin> source code in the current Python programming language
    Martin> would be rejected.

Fine with me for Elisp.  We don't have satisfactory UTF-8 support yet,
but will soon.  After that there will be no excuses for us, since the
editor is the interpreter.

This _is_ the direction we're heading.  Emacs treats Lisp files the
same as any other on initial loading, so we already have hooks in
place that could be used to translate.  The point is that if XEmacs
didn't accept the encoding of a Lisp file, it would be on the user-
provided codec to get it right.  It's not our problem (except to the
extent that we would of course provide such codecs).  We're currently
working to make codecs available to users in a convenient, consistent
way--and Python has the advantage that that part is done!

Such a hook probably would be something new in Python, but I don't see
that it would be terribly difficult to implement.

    Martin> Will you reject a source module just because it contains a
    Martin> latin-1 comment?

    >> That depends.  Somebody is going to run it through the
    >> converter; it's just a question of whether it's me, or the
    >> submitter.

    Martin> 'you' in this case isn't the maintainer of a software
    Martin> package; it is the Python source code parser...

Same thing either way, if you add the preprocessing hook.

IMO, the Python source code parser should never see any text data[1]
that is not UTF-8 encoded.  If you want to submit Python programs to
the parser that are not UTF-8 encoded, then it is your responsibility
as the programmer to make sure they get translated into UTF-8 (eg, by
the preprocessing hook) before the interpreter proper ever sees them.

Note that the Python language doesn't need to specify at all what's
allowed on the preprocessing hook.  It can be a Perl-to-Python
translator, for all I care.  You simply say "if the parser doesn't
accept it when run `python --skip-preprocessing-hook', it's not valid
Python."  No more problem.  From the user's point of view, in everyday
operation it's basically the same as a Python which accepts his
favorite encoding.  From your point of view, the interpreter is
invulnerable to coding issues.  Even if you choose to support complex
(eg, autorecognizing[2]) codecs on the pre-processing hook as part of
the Python library, bugs are more easily localized.

Since the preprocessing-hook would be callable from Python, it would
be easy to run it as a separate program, and require the users to send
the output of that as the bug report.

The final benefit is that in multilingual environments it makes use of
UTF-8 a lot more attractive to the users.  But those are exactly the
environments where coding cookies will be a massive pain for Python to
support them, because people will forever be copying the top matter
from German files into Polish files and forgetting to adjust the
cookie, etc.

Footnotes: 
[1]  Ie, Python language or character text.  It might be convenient to
have an octet-string primitive data type, in which you could put
EUC-encoded Japanese or Java byte codes.  However, the Python
interpreter would never do anything with them except (1) pass them
whole as arguments or variable values, (2) extract slices, and (3)
extract individual octets as an integral (but non-character) type.
(Roughly speaking.  There might be other operations that "base Python"
should implement, like applying codecs.)

[2]  But I recommend against this.  Don't offer support for such; it's
a time and effort sink, for little return.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.