Multibyte Character Surport for Python

Sat May 11 03:29:12 EDT 2002

>>>>> "Huaiyu" == Huaiyu Zhu <huaiyu at gauss.almadan.ibm.com> writes:

    Huaiyu> Martin v. Loewis <martin at v.loewis.de> wrote:

    >> For the Unicode type, nothing would change - Stephen did not
    >> propose to change the Unicode type.

    >> Instead, he proposed that non-ASCII identifiers are represented
    >> using UTF-8 encoded byte strings (instead of being represented
    >> as Unicode objects); in that case, and for those identifiers,
    >> len() would return the number of UTF-8 bytes.

    Huaiyu> But would that be different from the number of characters?

No, for all backward-compatible (== ASCII-only identifiers) code.
Yes, for code actually using the proposed extension to non-ASCII
identifiers.

    Huaiyu> My confusion comes from his assertion that Python itself
    Huaiyu> does not need to care whether it's raw string or unicode.

My assertion is that we can choose either, and Python itself will work
fine, not that Python itself doesn't need to care.  Furthermore, if we
choose UTF-8 as the internal encoding for non-ASCII identifiers,
Python itself doesn't need to be changed at all, except for the code
that tests whether an identifier is legal.

What would care is introspective code.  Examples:

(1) Code that constructs a distribution of lengths of identifiers
    known to the interpreter would be biased toward long identifiers,
    since in UTF-8 #octets >= #characters.

(2) Code that uses identifiers in eval constructs would need to do
    some horrible thing like

    exec "print x + y".decode('iso-8859-1').encode('utf-8')

Note that in this all-ASCII example it's redundant, but would work.
Also the PEP 263 mechanism could be extended to give the program an
"execution locale" and automatically do that conversion.  (Horrible,
but in the spirit of that PEP.)

    Huaiyu> Is there any need for the interpreter to split an
    Huaiyu> identifier into sequence of characters?  If the answer is
    Huaiyu> no, then I guess my question is moot.

There's no need that I know of for the interpreter to do so.  However
(one of) Martin's points is that there are (very useful!) tools that
do, and these would either be "broken by the extension" or "merely
unreliable" for code that uses the non-ASCII identifier extension,
depending on your point of view.

Obviously I prefer the latter interpretation.  I suggest that projects
that require reliable operation of introspective tools hire someone
like the martellibot to do coding standard enforcement<wink>.  But the
"broken" interpretation is also reasonable, and I assume that is the
one that MvL holds.

    Huaiyu> My question was about what would be the case under the
    Huaiyu> proposals.  But I guess I'm way out of my domain here.

The basic fact is that Unicode support for strings is already decided.
I disagree with some implementation decisions (eg, the idea of
prepending ZERO-WIDTH NO-BREAK SPACE to strings intended to be
exported in UTF-16 encoding is just insane IMO, code must be written

print list_of_strings[0]
for s in list_of_strings[1:]:
    print s[2:]

Yuck!)  But that's just something I can easily work around by defining
a codec to my liking---in fact the ease with which I can do this shows
the overall strength of the scheme adopted.

It is the interface of this well-ordered pythonic environment to the
disorderly world of natural language that is under discussion.  PEP
263 provides standard support for people who wish to embed localized
(ie, non-ASCII) literal strings (both ordinary and Unicode) and
comments in their source files.  Note that source code comes from
"outside of" Python; Python has no control over, nor even a way to
know, the encoding used.

Currently use of localized literal ordinary strings is possible, and
some projects depend on it, because of the specific Python
implementation.  PEP 263 standardizes the situation in a way backward
compatible to these (very natural) "abuses" of the implementation, and
mandates its extension to literal Unicode strings.

My proposal goes farther and allows localized identifiers.  AFAIK
Erno's use is just a curio; Alex's arguments for use of English if at
all possible, and certainly ASCII, in identifiers are strong and
natural.  Even Japanese programmers rarely break this rule.  So AFAIK
there is no body of code out there to be backward compatible with.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 My nostalgia for Icon makes me forget about any of the bad things.  I don't
have much nostalgia for Perl, so its faults I remember.  Scott Gilbert c.l.py