PEP 3131: Supporting Non-ASCII Identifiers

Thu May 17 13:41:03 EDT 2007

On Sun, 13 May 2007 17:44:39 +0200, Martin v. Löwis wrote:

> The syntax of identifiers in Python will be based on the Unicode
> standard annex UAX-31 [1]_, with elaboration and changes as defined
> below.
>
> Within the ASCII range (U+0001..U+007F), the valid characters for
> identifiers are the same as in Python 2.5.  This specification only
> introduces additional characters from outside the ASCII range.  For
> other characters, the classification uses the version of the Unicode
> Character Database as included in the ``unicodedata`` module.
>
> The identifier syntax is ``<ID_Start> <ID_Continue>*``.
>
> ``ID_Start`` is defined as all characters having one of the general
> categories uppercase letters (Lu), lowercase letters (Ll), titlecase
> letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
> (Nl), plus the underscore (XXX what are "stability extensions" listed in
> UAX 31).
>
> ``ID_Continue`` is defined as all characters in ``ID_Start``, plus
> nonspacing marks (Mn), spacing combining marks (Mc), decimal number
> (Nd), and connector punctuations (Pc).
>
>
> [...]
>
>.. [1] http://www.unicode.org/reports/tr31/

First, to Martin: Thanks for writing this PEP.

While I have been reading both sides of this debate and finding both
sides reasonable and understandable in the main, I have several
questions which seem to not have been raised in this thread so far. 

Currently, in Python 2.5, identifiers are specified as starting with
an upper- or lowercase letter or underscore ('_') with the following
"characters" of the identifier also optionally being a numerical digit
("0"..."9").

This current state seems easy to remember even if felt restrictive by
many.

Contrawise, the referenced document "UAX-31" is a bit obscure to me
(which is not eased by the fact that various browsers render non-ASCII
characters differently or not at all depending on the setup and font
sets available). Further, a cursory perusing of the unicodedata module
seems to refer me back to the Unicode docs.

I note that UAX-31 seems to allow "ideographs" as ``ID_Start``, for
example. From my relative state of ignorance, several questions come
to mind:

1) Will this allow me to use, say, a "right-arrow" glyph (if I can
find one) to start my identifier? 

2) Could an ``ID_Continue`` be used as an ``ID_Start`` if using a RTL
(reversed or "mirrored") identifier? (Probably not, but I don't know.)

3) Is or will there be a definitive and exhaustive listing (with
bitmap representations of the glyphs to avoid the font issues) of the
glyphs that the PEP 3131 would allow in identifiers? (Does this
question even make sense?)

I have long programmed in RPL and have appreciated being able to use,
say, a "right arrow" symbol to start a name of a function (e.g., "->R"
or "->HMS" where the '->' is a single, right-arrow glyph).[1]

While it is not clear that identifiers I may wish to use would still
be prohibited under PEP 3131, I vote:

     +0

__________________________________________
[1] RPL (HP's Dr. William Wickes' language and environment circa the
1980s) allows for a few specific "non-ASCII" glyphs as the start of a
name. I have solved my problem with my Python "appliance computer"
project by having up to three representations for my names: Python 2.x
acceptable names as the actual Python identifier, a Unicode text
display exposed to the end user, and also if needed, a bitmap display
exposed to the end user. So -- IAGNI. :-)

-- 
Richard Hanson