[Python-Dev] Allowing non-ASCII identifiers

François Pinard pinard at iro.umontreal.ca
Mon Feb 9 10:51:17 EST 2004


[Martin von Löwis]
> I'd like to work on adding support for non-ASCII characters
> in identifiers[...]

Such a support would surely be extremely welcome to me, and to most of
my co-workers.  There is likely many teams around this planet that would
appreciate it as well.  Tell me if you think I may help somehow, despite
my modest means (I'm over-loaded with duties already, but this is the
story for most of us).

> 1. At run-time, identifiers are represented as Unicode objects unless
> they are pure ASCII.  IOW, they are converted from the source encoding
> to Unicode objects in the process of parsing.

This is already the case, isn't it?

> 2. As a consequence of 1), all places there identifiers appear need to
> support Unicode objects (e.g. __dict__, __getattr__, etc)

I do not much know the internals, yet I suspect one more thing to
consider is whether Unicode strings looking like non-ASCII identifiers
should be interned or not, the same as currently done for ASCII.

> 3. Legal non-ASCII identifiers are what legal non-ASCII identifiers
> are in Java, except that Python may use a different version of the
> Unicode character database.  Python would share the property that
> future versions allow more characters in identifiers than older
> versions.

>    If you are too lazy too look up the Java definition, here is a
>    rough overview:  An identifier is "JavaLetter JavaLetterOrDigit*"

>    JavaLetter is a character of the classes Lu, Ll, Lt, Lm, or Lo,
>    or a currency symbol (for Python: excluding $), or a connecting
>    punctuation character (which is unfortunately underspecified - will
>    research the implementation).

>    JavaLetterOrDigit is a JavaLetter, or a digit, a numeric letter,
>    a combining mark, a non-spacing mark, or an ignorable control
>    character.

Then, maybe we should be a tad conservative whenever there is some
doubt, rather than sticking too closely to Java.  It is easier to
open a bit more later, than to close what was opened.  For example,
all currency symbols might be verboten to start with.  Or maybe not.
Connecting punctuation characters might be limited to the underline
to start with, and may be also added into JavaLetterOrDigit.  A sure
thing is that underlines should be allowed embedded within non-ASCII
identifiers.  Is the unbreakable space a "connecting punctuation"? :-)


Just for the amusement, I noticed that if file `francais.py' contains:

----------------------------------------------------------------------->
# -*- coding: Latin-1 -*-
élève = 3
print élève
-----------------------------------------------------------------------<

and file `francais' contains:

----------------------------------------------------------------------->
import locale
locale.setlocale(locale.LC_ALL, '')
import francais
-----------------------------------------------------------------------<

then command `python francais', in my environment where `LANG' is set to
`fr_CA.ISO-8859-1', does yield:

---------------------------------------------------------------------->
3
----------------------------------------------------------------------<

So, the Python compiler is sensitive to the active locale.  Someone
pointed out, a good while ago, that Latin-1 characters were accepted
interactively because `readline' was setting the locale, but it seems
that setting the locale ourselves allows for batch import as well.

This is kind of an happy bug!  May we count on it being supported in the
interim? :-) :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard



More information about the Python-Dev mailing list