Multibyte Character Surport for Python

Tue May 14 09:31:12 EDT 2002

Kragen Sitaker wrote:

> I agree that programming language keywords should not be localized;
> the notations for iteration, conditionals, math, abstraction,
> application, and so forth, should not vary by language.  It is
> perfectly acceptable for a person who does not speak English to learn
> "if", "for", "except", and so forth, in order to speak Python; the
> vocabulary is quite small.  It is no different from American musicians
> having to learn "allegro", "D.C. al fine", and "tremolo" --- it simply
> doesn't add significantly to the difficulty of the notation.

I disagree.  I wouldn't object if a language used "si" or "weil" instead 
of "if".  But I sure as heck wouldn't want to use a Chinese character.  No 
matter how good a programming language is, if it requires the use of 
Chinese characters I'm not touching it.  I wouldn't expect a monolingual 
Chinese speaker to feel any better about Python.  Remember the subject is 
"multibyte character support" not "alternative European code page 
support".

> But variable and function names belong to the programmer and the
> program's audience, not the notation, and should be written in the
> language that affords these people the most expressive power.

Yes, but you can write any language using the roman alphabet.  If you can 
learn to use that alphabet for the keywords, you can translate variable 
names as well.  It's only a matter of convenience, or for speakers of 
European languages that use accented characters.

Is it such a big problem to lose the accents?  You still have to deal with 
a standard library built around English.  And there are all kinds of 
problems that arise when you use arbitrary character sets.  Like (hoping 
these come out right) à and á can look similar from a distance, as can 
"Latin Small Letter A With Macron".  Would you feel confident 
distinguishing ã and ä on a low resolution monitor?  What happens if you 
receive code that uses a character set you don't have a font for?  If you 
look through some Unicode tables you'll see characters that look 
identical, in some cases are defined to be identical.  Does the 
interpreter have to keep a lookup table of equivalences?  How does it know 
what constitutes a "letter" in the first place?

I don't know if it's for English speakers to comment on, but I feel uneasy 
about such a change.  If the parser could recognise arbitrary characters, 
the regular expressions knew what a letter was independent of locale and 
Unicode strings could be reliably compared then at least the 
implementation would be easy.  But I can see people shooting themselves in 
the foot as easily as they do with pointer arithmetic.  Still, write a PEP 
if you know exactly what you want.  I could sleep much easier knowing such 
a proposal had been definitively rejected.

                          Graham