PEP 263 comments

Thu Feb 28 09:09:23 EST 2002

"Stephen J. Turnbull" <stephen at xemacs.org> writes:

> Hi, I'm Steve Turnbull, I do XEmacs.  Mostly Mule.  Barry asked me to
> step up to bat on this.

Thanks for your comments!

> You don't.  From now on, anything that goes into the official Python
> sources is in UTF-8.  Convert any existing stuff at your leisure.
> This is recommended practice for 3rd party projects, too.  People can
> do want they want with their own stuff, but they are on notice that if
> it screws up it's their problem.

It's worse than this: under the proposed change, Python would refuse
to accept source code if it is not UTF-8 encoded. In turn, code that
has a euc-jp comment in it and is now happily accepted as source code
in the current Python programming language would be rejected.

This is like mandating that all Emacs-Lisp files are UTF-8, whether
they are part of the Emacs sources, or installed somewhere out there
in the wild.

> XEmacs actually did this (half-way) three years ago.  I convinced
> Steve Baur to convert everything in the XEmacs CVS repository that
> wasn't ISO 8859/1 to ISO-2022-JP (basically, start in ASCII, all other
> character sets must designate to G0 or G1, and get invoked to GL; at
> newlines, return to ASCII by designation; the "JP" part is really a
> misnomer, it's fully multilingual).  Presto! no more accidental Mule
> corruption in the repository.

This is a different issue: We are not discussing the encoding that the
Python sources use in the Python CVS tree, we are discussing the
encoding that Python source code uses.

> Oh, and does Python have message catalogs and stuff like that?  Do you
> really want people doing multilingual work like translation mucking
> about with random coding systems and error-prone coding cookies?
> UTF-8 detection is much easier than detecting that an iso-8859-1
> cookie should really be iso-8859-15 (a reverse Turing test).

Python supports gettext, but this is still a different issue. The
Unicode type of Python is precisely that - it is not that Python would
support different wide character implementations internally. Again,
the issue is how source code is encoded.

> So much for the alleged "backward compatibility" non-issue.  :-)
> People are abusing implementation dependencies; Just Say No.

A very radical opinion :-) but I get the feeling you might be missing
the point in question ...

>     Martin> Will you reject a source module just because it contains a
>     Martin> latin-1 comment?
> 
> That depends.  Somebody is going to run it through the converter; it's
> just a question of whether it's me, or the submitter.  

'you' in this case isn't the maintainer of a software package; it is
the Python source code parser...

> GNU Emacs supports your coding system cookies.  XEmacs currently
> doesn't, but we will, I already figured out what the change is and
> told Barry OK.  And I plan to add cookie-checking to my latin-unity
> package (which undoes the Mule screwage that says Latin-1 NO-BREAK
> SPACE != Latin-2 NO-BREAK SPACE).  Other editors can do something
> similar.

I assume you are talking about the -*- coding: foo -*- stuff here?
*This* is the issue in question. Should we allow it, or should we
mandate that all Python source code (not just the one in the Python
CVS) is UTF-8?

> So people who insist on using a national coded character set in their
> editor use cookies.  Then the python-dev crew prepares a couple of
> trivial scripts which munge sources from UTF-8 to national codeset +
> cookie, and back (note you have to strip the cookie on the way back),
> for the sake of people whose editor's Python-mode doesn't grok cookies.

Again, not the issue: Most people run Python programs without ever
submitting them to python-dev :-)

You may wonder why Python (the programming language) needs to worry
about the encoding at all. The reason is that we allow Unicode
literals, in the form

   u"text"

The question is what is the encoding of "text", on disk. In memory, it
will be 2-byte Unicode, so the interpreter needs to convert. To do
that, it must know what the encoding is, on disk. The choices are
using either UTF-8, or allowing encoding cookies.

> My apologies for the flood.  I've been thinking about exactly this
> kind of transition for XEmacs for about 5 years now, this compresses
> all of that into a few dozen lines....

I'm not sure whether a similar issue exists in XEmacs: the encoding of
ELisp would be closest, but only if the Lisp interpreter ever needs to
worry about that.

Regards,
Martin