Well, I finally ran into a Python Unicode problem, sort of

Chris Angelico rosuav at gmail.com
Sun Jul 3 03:41:59 EDT 2016


On Sun, Jul 3, 2016 at 4:58 PM, John Ladasky <john_ladasky at sbcglobal.net> wrote:
> Up until today, every character I've tried has been accepted by the Python interpreter as a legitimate character for inclusion in a variable name.  Now I'm copying a formula which defines a gradient.  The nabla symbol (∇) is used in the naming of gradients.  Python isn't having it.  The interpreter throws a "SyntaxError: invalid character in identifier" when it encounters the ∇.
>
> I am now wondering what constitutes a valid character for an identifier, and how they were chosen.  Obviously, the Western alphabet and standard Greek letters work.  I just tried a few very weird characters from the Latin Extended range, and some Cyrillic characters.  These are also fine.
>

Very good question! The detaily answer is here:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

> A philosophical question.  Why should any character be excluded from a variable name, besides the fact that it might also be an operator?
>

In a way, that's exactly what's happening here. Python permits certain
categories of character as identifiers, leaving other categories
available for operators. Even though there aren't any non-ASCII
operators in a vanilla CPython, it's plausible that someone could
create a Python-based language with more operators (eg ≠ NOT EQUAL TO
as an alternative to !=), and I'm sure you'd agree that saying "≠ = 1"
is nonsensical.

> This might be a problem I can solve, I'm not sure.  Is there a file that the Python interpreter refers to which defines the accepted variable name characters?  Perhaps I could just add ∇.
>

The key here is its Unicode category:

>>> unicodedata.category("∇")
'Sm'

You could probably hack CPython to include Sm, and maybe Sc, Sk, and
So, as valid identifier characters. I'm not sure where, though, and
I've just spent a good bit of time delving (it's based on the
XID_Start and XID_Continue derived properties, but I have no idea
where they're defined - Tools/unicode/makeunicodedata.py looks
promising, but even there, I can't find it). And - or maybe instead -
you could appeal to the core devs to have the category/ies in question
added to the official Python spec. Symbols like that are a bit of a
grey area, so you may find that you're starting a huge debate :)

Have fun.

ChrisA



More information about the Python-list mailing list