Multibyte Character Surport for Python

John Roth johnroth at ameritech.net
Sat May 11 08:09:44 EDT 2002


"Martin v. Löwis" <loewis at informatik.hu-berlin.de> wrote in message
news:j44rhf40as.fsf at informatik.hu-berlin.de...
> "John Roth" <johnroth at ameritech.net> writes:
>
> > The trouble is that while almost all of the languages used in the
> > Americas, Australia and Western Europe are based on
> > the Latin alphabet, that isn't true in the rest of the world, and
> > even then, it gets uncomfortable if your particular language's
> > diacritical marks aren't supported. You can't do really good,
> > descriptive names.
>
> I personally can live without the diacritical marks in program source
> code, except when it comes to spelling my name - and I usually put
> this into strings and comments only.
>
> I'm fully aware that many people in this world write their languages
> without latin letters. I still doubt that this is an obstacle when
> writing software.
>
> > 1. In Python 3.0, the input character set is unicode - either UTF-16
> > or UTF-8 (I'm not prepared to make a solid arguement one way or the
> > other at this time.)
>
> Actually, PEP 263 gives a much wider choice; consider this aspect
> solved.

I just read that PEP. As far as I'm concerned, it's not solved, the
solution would be much worse than the disease. Python is noted
for simplicity and one way to do most things. PEP 263 (outside of
syntax issues) simply obfuscates the issue for quite minor returns.

> > 2. All identifiers MUST be expressed in the character set of
> > a single language (treating the various latin derived languages
> > as one for simplicity.) That doesn't mean that only one language
> > can be used for a module, only that a particular identifer must make
> > lexical sense in a specific language.
>
> That sounds terrible. Are you sure you can implement this? For
> example, what about the Cyrillic-based languages? Are you also
> treating them as one for simplicity? Can you produce a complete list
> of languages, and for each one, a complete list of characters?

I believe that the Unicode Consortium has already considered this.
After all, they didn't just add character encodings at random; they've
got specific support for many, many languages. I don't need to
repeat their work, and much more importantly, neither does the
core Python language team.

> > 3. There must be a complete set of syntax words in each
> > supported language. That is, words such as 'and', 'or', 'if', 'else'
> > All such syntax words in a particular module must come from the
> > same language.
>
> That is even more terrible. So far, nobody has proposed to translate
> Python keywords. How are you going to implement that: i.e. can you
> produce a list of keywords for each language? How would I spell 'def'
> in German?

AFIC, spelling is up to people who want to code in a particular
language.
I haven't considered implementation, but it seems like it should be
incredibly simple, given that point 4 means that syntax words are
easily distinguishable by the lexer. Think in terms of a dictionary,
although performance considerations probably means that something
faster would be necessary.

John Roth






More information about the Python-list mailing list