[Python-Dev] forwarded message from Stephen J. Turnbull

Stephen J. Turnbull stephen@xemacs.org
04 Mar 2002 12:57:48 +0900


>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes:

    Martin> I'm not sure whether he still has his original position
    Martin> "do not allow multiple source encodings to enter the
    Martin> language", which, to me, translates into "source encodings
    Martin> are always UTF-8".

Yes, it is.  I feel that it is possible to support the users who want
to use national encodings AND define the language in terms of a single
coded character set, as long as that set is Unicode.  The usual
considerations of file system safety and standard C library
compatibility dictate that the transformation format be UTF-8.  (Below
I will just write "UTF-8" as is commonly done.)

My belief is that the proposal below has the same effect on most users
most of the time as PEP 263, while not committing Python to indefinite
support of a subsystem that will certainly be obsolete for new code in
5 years, and most likely within 2 (at least for people using open
source and major vendor tools, I don't know what legacy editors people
may be using on "big iron" and whatnot).

    Martin> If that is the route to take, PEP 263 should be REJECTED,
    Martin> in favour of only tightening the Python language
    Martin> definition to only allow UTF-8 source code.

I think so.

    Martin> For Python, it probably would have to go to the second
    Martin> line, with the rationale given in the Emacs manual: the
    Martin> first line is often used for #!.

Precisely.

I do not have time or the background to do a formal counter-PEP for
several weeks (likely late April), since I'd have to do a fair amount
of research into both Python internals and PEP procedure.  I'd be
happy to consult if someone who does know those wants to take it and
run with it.

Here's the bones:

1.  a.  Python source code is encoded in UTF-8, with the full UTF-32
    character set.  The parser proper will reject as corrupted
    anything that doesn't have the characteristic UTF-8 leading-byte
    trailing-bytes signature.

    b.  Some provision must be made for systematic handling of private
    characters.  Ie, there should be a possibility to register and be
    dynamically a block from private space.  You also need to be able
    to request a specific block and move blocks, because many vendors
    (Apple and Microsoft spring immediately to mind) allow their apps
    to use fixed blocks in private space for vendor character sets.
    At this stage it suffices to simply advise that any fixed use of
    the private space is likely to conflict with future standards for
    sharing that space.

    c.  This proposal takes no stand on the use of non-ASCII in
    keywords and identifiers.

Accomodation of existing usage:

2.  Python is a scripting language already in widespread use with
    ambitions of longevity; provision must be made for quick hacks and
    legacy code.  This will be done via a preprocessing hook and
    (possibly) i/o hooks.

    The preprocessing hook is a filter which is run to transform the
    source code buffer on input.  It is the first thing done.  Python
    (the language) will never put anything on that hook; any code that
    requires a non-null hook to function is not "true" Python.  Thus
    there need be no specification for the hook[1]; anything the user
    puts on the hook is part of their environment.  The preprocessing
    hook can be disabled via a command line switch and possibly an
    environment variable (it might even make sense for the hook
    function to be named in an environment variable, in which case a
    null value would disable it).

    The intended use is a codec to be run on the source buffer to
    convert to UTF-8.

3.  The I/O hooks would be analogous, although you run into the usual
    problems that many I/O channels obey much less stringent
    consistency conditions than files, and in general need not be
    rewindable.  A similar hook would presumably be desirable for
    primitive functions that "eval" strings.

4.  It probably won't be possible to simply plug in existing codecs
    without specifying the hook too precisely.  Therefore Python
    should provide a library of codec wrappers for hanging on the
    hook.

5.  Users who wish to use non-UTF-8 encodings are strongly advised to
    use the "coding-cookie-in-comment at top of file" convention.  This
    convention is already supported by at least GNU Emacs and XEmacs
    (modulo XEmacs's "first line only bug") and should be easily
    implemented in other editors, including IDLE.  To encourage this,
    the library mentioned in 4 should provide an "autorecognition"
    codec with at least the features that (1) it recognizes and acts
    on coding cookies, with good, verbose error messages if
    "corruption" is detected; (2) it recognizes and acts on the UTF-8
    BOM, with "good" error messages; and (3) otherwise it defaults to
    UTF-8, with "good" error messages.

    This would allow the "naked" interpreter to just give a terse
    "that ain't UTF-8" message.  The "naked" interpreter might want to
    error on a coding cookie.  I think a coding cookie of "utf-8"
    should probably be considered an error, as it indicates that the
    user doesn't know the language spec.<wink>  It might be desirable
    to extend feature (2) to other Unicode BOMs.

    Experience with Emacs Mule suggests that "smart" autorecognition
    (eg of ISO 2022 versions) is not something that Python should
    support as a standard feature, although the preprocessor hook
    would allow optional modules for this purpose to added easily.
    Another "smart" hook function might make assumptions based on
    POSIX locale, etc.

6.  Some provision will probably need to be made for strings.

    Ordinary strings might need to be converted to Unicode or not,
    depending on how non-UTF-8 I/O channels are supported.  So the
    "codec wrappers" mentioned in 2, 3, 4, and 5 would probably need
    to understand Python string syntax, and it might be useful to have
    a "newt string" type.  A "newt string" would _always_ be protected
    from conversion to Unicode (and would have a minimal API to force
    programmers to not use them 'til "it got bettah").

    Unicode strings would be exactly that, and legacy strings would
    have semantics depending on the stage of the transition to
    Unicode-only, and possibly the user's environment.


Footnotes: 
[1]  Well, you could try to make that stick.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.