PEP 263 comments

Stephen J. Turnbull stephen at xemacs.org
Fri Mar 1 06:34:59 EST 2002


>>>>> "Martin" == Martin v Loewis <martin at v.loewis.de> writes:

    Martin> "Stephen J. Turnbull" <stephen at xemacs.org> writes:
    >> IMO, the Python source code parser should never see any text
    >> data[1] that is not UTF-8 encoded.  If you want to submit
    >> Python programs to the parser that are not UTF-8 encoded, then
    >> it is your responsibility as the programmer to make sure they
    >> get translated into UTF-8 (eg, by the preprocessing hook)
    >> before the interpreter proper ever sees them.

    Martin> That would cause surprises to users. They have a source
    Martin> program that says

    Martin> # -*- coding: koi8-r -*-
    Martin> print "some cyrillic text"

    Martin> This currently works fine on their system; the text comes
    Martin> out on the terminal just right. Now, Python would convert
    Martin> this text silently to UTF-8 behind their backs, and the
    Martin> terminal would show just garbage.

No, it shows "Error: non-UTF-8 data detected in string."  Conversion
only takes place if a preprocessing hook function is defined, and the
same environment that provides an appropriate preprocessing hook will
also arrange to make sure that program I/O is done in KOI8-R, too.

But I take your point.  It will take time to develop such
environments.  In the interim, it will cause users who are currently
depending on undefined behavior pain.

You _can_ say "no" now, while things are undefined.  Or you can change
the language definition to promise support.  If you do that, you are
unlikely to be able to get rid of that support for decades, as legacy
software will depend on it.

    Martin> How to print it is then another issue; you'll have to
    Martin> figure out the encoding of the terminal - that is
    Martin> feasible in most cases.

Why open up that Pandora's box?  Push it out into user space.  Support
them as much as you want to with libraries, give up when it gets too
hard (it will!).  My experience is that users will not thank you for
anything less than perfect support for all coding systems yesterday,
if the language definition promises any support at all.  If the
language definition says "UTF-8 or die", they will thank you for the
nice codecs you provide to ease the transition.

    >> [1] Ie, Python language or character text.  It might be
    >> convenient to have an octet-string primitive data type, in
    >> which you could put EUC-encoded Japanese or Java byte codes.

    Martin> The traditional "string" type is, in fact, a byte string
    Martin> type. Many people use it still for character strings,

Maybe you don't need a third type.  I see it as a matter of a
transition strategy, to allow you to generate exactly the error I
suggest above.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.



More information about the Python-list mailing list