unicode as valid naming symbols

Tue Apr 1 09:33:33 EDT 2014

On 4/1/14 9:00 AM, Chris Angelico wrote:
> On Tue, Apr 1, 2014 at 10:59 PM, Antoon Pardon
> <antoon.pardon at rece.vub.ac.be> wrote:
>> On 01-04-14 12:58, Chris Angelico wrote:
>>> But because, in the future, Python may choose to create new operators,
>>> the simplest and safest way to ensure safety is to put a boundary on
>>> what can be operators and what can be names; Unicode character classes
>>> are perfect for this. It's also possible that all Unicode whitespace
>>> characters might become legal for indentation and separation (maybe
>>> they are already??), so obviously they're ruled out as identifiers;
>>> anyway, I honestly do not think people would want to use U+2007 FIGURE
>>> SPACE inside a name. So if we deny whitespace, and accept letters and
>>> digits, it makes good sense to deny mathematical symbols so as to keep
>>> them available for operators. (It also makes reasonable sense to
>>> *permit* mathematical symbols, thus allowing you to use them for
>>> functions/methods, in the same way that you can use "n", "o", and "t",
>>> but not "not"; but with word operators, the entire word has to be used
>>> as-is before it's a collision - with a symbolic one, any instance of
>>> that symbol inside a name will change parsing entirely. It's a
>>> trade-off, and Python's made a decision one way and not the other.)
>>
>> This mostly makes sense to me. The only caveat I have is that since we
>> also allow _ (U+005F LOW LINE) in names which belongs to the category
>> <puctuation, connector>, we should allow other symbols within this
>> category in a name.
>>
>> But I confess that is mostly personal taste, since I find names_like_this
>> ugly. Names-like-this look better to me but that wouldn't be workable
>> in python. But maybe there is some connector that would be aestetically
>> pleasing and not causing other problems.
>
> That's reasonable. The Pc category doesn't have much in it:
>
> http://www.fileformat.info/info/unicode/category/Pc/list.htm
>
> If the definition of "characters permitted in identifiers" is derived
> exclusively from the Unicode categories, including Pc would make fine
> sense. Probably the definition should be: First character is L* or Pc,
> subsequent characters are L*, N*, or Pc, and either Mn or M*
> (combining characters). Or something like that.

Maybe I'm misunderstanding the discussion... It seems like we're talking 
about a hypothetical definition of identifiers based on Unicode 
character categories, but there's no need: Python 3 has defined 
precisely that.  From the docs 
(https://docs.python.org/3/reference/lexical_analysis.html#identifiers):

---<snip>---------

Python 3.0 introduces additional characters from outside the ASCII range 
(see PEP 3131). For these characters, the classification uses the 
version of the Unicode Character Database as included in the unicodedata 
module.

Identifiers are unlimited in length. Case is significant.

identifier   ::=  xid_start xid_continue*
id_start     ::=  <all characters in general categories Lu, Ll, Lt, Lm, 
Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue  ::=  <all characters in id_start, plus characters in the 
categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start    ::=  <all characters in id_start whose NFKC normalization 
is in "id_start xid_continue*">
xid_continue ::=  <all characters in id_continue whose NFKC 
normalization is in "id_continue*">

The Unicode category codes mentioned above stand for:

     Lu - uppercase letters
     Ll - lowercase letters
     Lt - titlecase letters
     Lm - modifier letters
     Lo - other letters
     Nl - letter numbers
     Mn - nonspacing marks
     Mc - spacing combining marks
     Nd - decimal numbers
     Pc - connector punctuations
     Other_ID_Start - explicit list of characters in PropList.txt to 
support backwards compatibility
     Other_ID_Continue - likewise

All identifiers are converted into the normal form NFKC while parsing; 
comparison of identifiers is based on NFKC.

---<end snip>-----

>
> ChrisA
>

-- 
Ned Batchelder, http://nedbatchelder.com