Allowing non-ASCII identifiers

Thu Feb 12 02:04:11 EST 2004

Dietrich Epp wrote:
> You could require that all identifiers be the canonically decomposed 
> Unicode representations encoded into UTF-8.  

That would be unpythonic: non-ASCII identifiers should be represented
as Unicode objects, not as UTF-8 byte strings.

> I personally think that Unicode identifiers would be catastrophic.  With 
> Unicode on the web, if you can't represent some characters, you can't 
> read the web page.  With programming, it could mean that you are unable 
> to use a particular module, altering the functionality for people who 
> can't enter certain codes. 

It is the case that some people would have problems invoking certain
functions. Why would that be a catastrophy? Authors of Python software
should make a choice whether they prefer readability of the source code,
or accessibility to everyone. Depending on the situation, one choice
or the other may be appropriate. Python should not police that decision
for the developer.

> There is also the issue of which characters 
> to allow, because some characters look like numbers.

Yes. I would go with a list similar to the Java one, except with a
few obvious restrictions (e.g. disallow currency symbols: Python
does not allow the DOLLAR SIGN in identifiers, whereas Java does).

> Is unicode 'IV' a number or an identifier?

It is certainly *not* a number. I propose to change the syntax of
identifiers, not of numbers. Whether this specific character Ⅳ is
an identifier or should give a syntax error is a choice one needs
to make, certainly. What would be your choice?

 > What about a circled 4?  What about unicode
> line breaks and paragraph breaks?  What about opening and closing quote 
> marks?  What about right-to-left characters?  What about ligatures?  
> Non-breaking spaces?  Function application?

The Unicode consortium gives guidance on all these questions. As I said,
I would closely follow the Java principles, which were derived from
the Unicode consortium guidance. Here is my proposal:

     Legal non-ASCII identifiers are what legal non-ASCII
     identifiers are in Java, except that Python may use
     a different version of the Unicode character database.
     Python would share the property that future versions
     allow more characters in identifiers than older versions.

     If you are too lazy too look up the Java definition,
     here is a rough overview:
     An identifier is "JavaLetter JavaLetterOrDigit*"

     JavaLetter is a character of the classes Lu, Ll,
     Lt, Lm, or Lo, or a currency symbol (for Python:
     excluding $), or a connecting punctuation character
     (which is unfortunately underspecified - will
      research the implementation).

     JavaLetterOrDigit is a JavaLetter, or a digit,
     a numeric letter, a combining mark, a non-spacing
     mark, or an ignorable control character.

I believe this specification allows you to answer your questions
yourself.

> I think the assumption some people have is that Unicode will only ever 
> be used for things that are like the roman alphabet: adding diacritical 
> marks, etc.  It sounds like the most worthless extension ever, and the 
> only language I think of when I think of special characters is Intercal.  

That is certainly not my assumption. Instead, I expect that this
extension will primarily be used by developers whose native language
is Russian, Japanese, Chinese, Korean, or Arabic. Atleast, I've heard
developers from these cultures ask for the specific feature in the
past (I've also heard French and German people ask for the feature,
but that fits with your expectation).

Regards,
Martin