Multibyte Character Surport for Python

Stephen J. Turnbull stephen at xemacs.org
Thu May 9 10:08:01 EDT 2002


>>>>> "Alex" == Alex Martelli <aleax at aleax.it> writes:

    Alex> Stephen J. Turnbull wrote:

    >> Agreed.  Except that a decade from now Chinese might be the
    >> ONE.  Then we'll be glad we have hanzi identifiers, as Python
    >> sweeps the CJK world.<0.9 wink>

    Alex> Fine, WHEN that happens.  And IF, of course.  Meanwhile,
    Alex> Ruby will probably get there first, born in Japan and all,

"Made in Japan" is not exactly a password to rapid acceptance in this
neighborhood.  It's getting more so, but not there yet.

    Alex> right?  Hey, it IS so close to Python it almost hurts.  And
    Alex> AFAIK, it doesn't support what you so intensely want, which

It's not that intense.  I just see the tradeoff as being much more
balanced than you do.  CP4E is important, too.

    Alex> hasn't stopped it from huge Japan success, though it may in
    Alex> the future.

The insularity you mention is working for it here.  But its success is
deserved.

    >> ... is it really worth sacrificing the ability to introduce
    >> more non-programmers to programming to avoid "helping
    >> fragmentation" by 25% over what those who want localized
    >> identifiers already can do?

    Alex> Yes.  And it's NOT worth (IMHO, of course) helping the
    Alex> Japanese keep ever more insular and separated from the rest

The ones who are causing and aggravating the problems will _never_
write any code; in fact, more widespread computer literacy---even if
everything the people think they know is wrong!---would substantially
decrease the gerontocrats' power, IMO.

    Alex> Sure, it's bad enough if said code has an identifier
    Alex> "principi" -- you don't know what it means and must infer
    Alex> from context.  (The comments and docstrings if any are
    Alex> likely just as obscure).

And if the fact that the ONE language is Italian (I presume) forced me
to choose "principi" as an identifier, thus misdocumenting the variable
to those who _do_ understand Italian?

    Alex> confusion is still possible but much less likely than in a
    Alex> language WITH diacritics

Very true.

    Alex> or thousands of different glyphs

This will make it unreadable to non-CJK-capable programmers, true.
But the likelihood of typos and confusion is at least as low as in
English, perhaps lower.  It is also likely to permit similar levels of
expressiveness in less space (even with the 2:1 width ratio common for
ideographs vs. alphabetic characters).

    Alex> They may THINK they want to "speak their own language to the
    Alex> computer", but they _don't_, really: if they think so, it's
    Alex> because they still haven't grasped the key differences.

Agreed.  Unfortunately, my faculty won't permit the use of a LART,
which is the only technique I know of to get a reasonable share of
their attention to direct at key differences.  Smacking them with
English just puts them to sleep.

    Alex> Help them learn, rather than "helping" them hide their
    Alex> ignorance from themselves.

MHO (based on the limited, very introductory courses I've taught in
programming) is that helping them to learn what programming really is
means removing as many of the incidental difficulties involved getting
their first real (ie, a task they choose) program working as possible.

Disciplines of good identifier choice, etc, come later.  These have to
be enforced by The Management, anyway.  Simply saying "no kanji" isn't
enough, as you well know (and the no-kanji rule is easily enforced
mechanically, which is something you can't say for "choose meaningful
identifiers").

Note that as an economics professor, I do have some experience with
the issue of weaning students from their milk language.  There is
nearly zero published work of professional interest---except for
national economic policy---in any language but English.  Not even
French or Russian.  That doesn't stop there from being about 50
Japanese-language journals---but the students all know what's good for
them, and they don't _target_ the vernacular journals.

I can see all the reasons why that might not carry over to
programming.  But on the other hand, it shows that there _is_
possibility that you can accomplish your goal with not very much extra
effort.  You just need to convince the leaders.  Even in a world where
one can choose identifiers written with ideographs.

In any case, Martin's point about bytecode compatibility once we
introduce Unicode identifiers is probably enough to make a real
multilingual Python impractical for the foreseeable future (maybe
Python 4?)  I plan to experiment with a UTF-8 Python anyway.  Keeping
your comments in mind, one thing I'll work on early is tools for
translating identifiers (presumably to English) and on metrics for
identifiers that are "too close" to one another.  Even if Python never
needs them, some language will.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 My nostalgia for Icon makes me forget about any of the bad things.  I don't
have much nostalgia for Perl, so its faults I remember.  Scott Gilbert c.l.py



More information about the Python-list mailing list