PEP 3131: Supporting Non-ASCII Identifiers

Fri May 18 00:28:03 EDT 2007

> Currently, in Python 2.5, identifiers are specified as starting with
> an upper- or lowercase letter or underscore ('_') with the following
> "characters" of the identifier also optionally being a numerical digit
> ("0"..."9").
> 
> This current state seems easy to remember even if felt restrictive by
> many.
> 
> Contrawise, the referenced document "UAX-31" is a bit obscure to me

It's actually very easy. The basic principle will stay: the first
character must be a letter or an underscore, followed by letters,
underscores, and digits.

The question really is "what is a letter"? what is an underscore?
what is a digit?

> 1) Will this allow me to use, say, a "right-arrow" glyph (if I can
> find one) to start my identifier? 

No. A right-arrow (such as U+2192, RIGHTWARDS ARROW) is a symbol
(general category Sm: Symbol, Math). See

http://unicode.org/Public/UNIDATA/UCD.html

for a list of general category values, and

http://unicode.org/Public/UNIDATA/UnicodeData.txt

for a textual description of all characters.

Now, there is a special case in that Unicode supports "combining
modifier characters", i.e. characters that are not characters
themselves, but modify previous characters, to add diacritical
marks to letters. Unicode has great flexibility in applying these,
to form characters that are not supported themselves. Among those,
there is U+20D7, COMBINING RIGHT ARROW ABOVE, which is of general
category Mn, Mark, Nonspacing.

In PEP 3131, such marks may not appear as the first character
(since they need to modify a base character), but as subsequent
characters. This allows you to form identifiers such as
v⃗ (which should render as a small letter v, with an vector
arrow on top).

> 2) Could an ``ID_Continue`` be used as an ``ID_Start`` if using a RTL
> (reversed or "mirrored") identifier? (Probably not, but I don't know.)

Unicode, and this PEP, always uses logical order, not rendering order.
What matters is in what order the characters appear in the source code
string.

RTL languages do pose a challenge, in particular since bidirectional
algorithms apparently aren't implemented correctly in many editors.

> 3) Is or will there be a definitive and exhaustive listing (with
> bitmap representations of the glyphs to avoid the font issues) of the
> glyphs that the PEP 3131 would allow in identifiers? (Does this
> question even make sense?)

It makes sense, but it is difficult to implement. The PEP already
links to a non-normative list that is exhaustive for Unicode 4.1.
Future Unicode versions may add additional characters, so the
a list that is exhaustive now might not be in the future. The
Unicode consortium promises stability, meaning that what is an
identifier now won't be reclassified as a non-identifier in the
future, but the reverse is not true, as new code points get
assigned.

As for the list I generated in HTML: It might be possible to
make it include bitmaps instead of HTML character references,
but doing so is a licensing problem, as you need a license
for a font that has all these characters. If you want to
lookup a specific character, I recommend to go to the Unicode
code charts, at

http://www.unicode.org/charts/

Notice that an HTML page that includes individual bitmaps
for all characters would take *ages* to load.

Regards,
Martin

P.S. Anybody who wants to play with generating visualisations
of the PEP, here are the functions I used:

def isnorm(c):
    return unicodedata.normalize("NFC", c)

def start(c):
    if not isnorm(c):
        return False
    if unicodedata.category(c) in ('Ll', 'Lt', 'Lm', 'Lo', 'Nl'):
        return True
    if c==u'_':
        return True
    if c in u"\u2118\u212E\u309B\u309C":
        return True
    return False

def cont_only(c):
    if not isnorm(c):
        return False
    if unicodedata.category(c) in ('Mn', 'Mc', 'Nd', 'Pc'):
        return True
    if 0x1369 <= ord(c) <= 0x1371:
        return True
    return False

def cont(c):
    return start(c) or cont_only(c)

The isnorm() aspect excludes characters from the list which
change under NFC. This excludes a few compatibility characters
which are allowed in source code, but become indistinguishable
from their canonical form semantically.