PEP 263 comments

Thu Feb 28 06:39:06 EST 2002

Hi, I'm Steve Turnbull, I do XEmacs.  Mostly Mule.  Barry asked me to
step up to bat on this.

Background: I've been doing Japanese Emacs ("nemacs") and Mule for 12
years now (it's where I live...).  I've been watching Japanese open
source both as a wannabe hacker and a social scientist (that's my day
job) for about 10 years.

>>>>> "Martin" == Martin v Loewis <martin at v.loewis.de> writes:

    Martin> "Jason Orendorff" <jason at jorendorff.com> writes:

    >> Counter-proposal:
    >> - Comment syntax: none.
    >> - UTF-8 file signature: not supported.
    >> - Python source code encoding: must always be UTF-8.
    >> - Implementation: within the parser, everything's just
    >> ordinary UTF-8 bytes.
    >> - IDLE: always save UTF-8 unless otherwise directed.

    Martin> Do you seriously want to pursue this route?

Yes, I think you do.  I've watched Japanese patches to apps languish,
never to be merged, for 2, 5, 10 years.  I'm sure some go back farther
than that.  All because the Japanese want to use their favorite
encodings internally and in sources, none of which (except
ISO-2022-JP) bother to announce that they are Japanese in any way.

You're really not doing anyone a favor by supporting "my favorite
encoding" anymore.  Make EUC and ISO 2022 users, Shift JIS and Big 5
abusers, Windows 125x <l-word deleted> check their weapons at the
door: as soon as you get inside Python, it's all Unicode.

    Martin> If so, how do you want to deal with backwards
    Martin> compatibility?

You don't.  From now on, anything that goes into the official Python
sources is in UTF-8.  Convert any existing stuff at your leisure.
This is recommended practice for 3rd party projects, too.  People can
do want they want with their own stuff, but they are on notice that if
it screws up it's their problem.

XEmacs actually did this (half-way) three years ago.  I convinced
Steve Baur to convert everything in the XEmacs CVS repository that
wasn't ISO 8859/1 to ISO-2022-JP (basically, start in ASCII, all other
character sets must designate to G0 or G1, and get invoked to GL; at
newlines, return to ASCII by designation; the "JP" part is really a
misnomer, it's fully multilingual).  Presto! no more accidental Mule
corruption in the repository.

NB: UTF-8 is much more tractable than ISO-2022-JP, precisely because
of the ASCII 0x22 issue.  We've had problems with that (since we still
support non-Mule XEmacsen that don't understand ISO 2022 controls).
UTF-8 makes this a non-issue.

We left a lot of stuff that was ISO 8859/1 as is.  But this is a bad
idea post-Euro, and causes the occasional embarrassment as we pick up
more Latin-N, N != 1, users.  The Euro by itself wouldn't be a
problem, nobody uses the generic currency symbol except as a bullet in
lists.  But (and again this is based on my Japanese experience) it's
the other characters in Latin-9 that are the most important characters
in the world---to those who use them: they're part of their names.
But people do occasionally use the accents in composing characters and
libraries, so deprecating Latin-1 in favor of Latin-9 probably is
going to annoy a few people who have always followed the rules.  So
make a clean sweep, now.  In two years, you'll have no regrets.

Oh, and does Python have message catalogs and stuff like that?  Do you
really want people doing multilingual work like translation mucking
about with random coding systems and error-prone coding cookies?
UTF-8 detection is much easier than detecting that an iso-8859-1
cookie should really be iso-8859-15 (a reverse Turing test).

    Martin> Currently, you can put arbitrary bytes in character
    Martin> strings, and people make use of this opportunity (even

I have no sympathy for self-inflicted injuries anymore.  The amount of
effort that has gone into maintaining Japanese patches for Pine and
ghostscript has been extremely painful to watch.  Anyway, Python
itself provides the necessary tools for salvation.

    Martin> though the documentation says this is undefined).

So much for the alleged "backward compatibility" non-issue.  :-)
People are abusing implementation dependencies; Just Say No.

    Martin> Will you reject a source module just because it contains a
    Martin> latin-1 comment?

That depends.  Somebody is going to run it through the converter; it's
just a question of whether it's me, or the submitter.  In the case of
XEmacs, because everybody uses Emacs to develop, it's just not an
issue: somebody commits the change from (eg) EUC-JP to ISO-2022-JP,
and after that Mule does its thing---nobody even notices, unless they
do a diff.  Even at the time we got very few complaints about spurious
diffs.  Now, never.

This is true for Python, too.  I don't care if people want to do their
editing in ISCII or KOI8-R or Windows-1252 even.  _They have the tool
needed to convert, by definition: they're Python users._  Here's where
cookies come in.

GNU Emacs supports your coding system cookies.  XEmacs currently
doesn't, but we will, I already figured out what the change is and
told Barry OK.  And I plan to add cookie-checking to my latin-unity
package (which undoes the Mule screwage that says Latin-1 NO-BREAK
SPACE != Latin-2 NO-BREAK SPACE).  Other editors can do something
similar.

So people who insist on using a national coded character set in their
editor use cookies.  Then the python-dev crew prepares a couple of
trivial scripts which munge sources from UTF-8 to national codeset +
cookie, and back (note you have to strip the cookie on the way back),
for the sake of people whose editor's Python-mode doesn't grok cookies.

I expect that with that kind of support, what is left is just enough
pain to induce lots of people to switch to UTF-8-capable editors, and
just little enough that you can say "well, this really is for
everybody's benefit; we know it's inconvenient in the transition, and
we're doing the best we can to ease it" to the rest, and not be lynched.

This disposes of the "not everyone has a UTF-8 editor" issue.  Also,
in my experience distinguishing UTF-8 from "all other coding systems"
is hardly error-prone at all, except for files with extremely low
non-ASCII content, like under 50 bytes of non-ASCII.  Ben Wing has
already implemented statistical detection (ie, returns a degree of
likelihood, and could -- not implemented yet -- look at statistical
properties of the text) for XEmacs 22.  I imagine I could persuade him
to donate code to Python.

My apologies for the flood.  I've been thinking about exactly this
kind of transition for XEmacs for about 5 years now, this compresses
all of that into a few dozen lines....

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.