[Python-ideas] π = math.pi

Sat Jun 3 15:55:05 EDT 2017

On 2017-06-03 19:50, Thomas Jollans wrote:
> On 03/06/17 18:48, Steven D'Aprano wrote:
>> On Sun, Jun 04, 2017 at 02:36:50AM +1000, Steven D'Aprano wrote:
>>
>>> But Python 3.5 does treat it as an identifier!
>>>
>>> py> ℘ = 1  # should be a SyntaxError ?
>>> py> ℘
>>> 1
>>>
>>> There's a bug here, somewhere, I'm just not sure where...
>> That appears to be the only Symbol Math character which is accepted as 
>> an identifier in Python 3.5:
>>
>> py> import unicodedata
>> py> all_unicode = map(chr, range(0x110000))
>> py> symbols = [c for c in all_unicode if unicodedata.category(c) == 'Sm']
>> py> len(symbols)
>> 948
>> py> ns = {}
>> py> for c in symbols:
>> ...     try:
>> ...             exec(c + " = 1", ns)
>> ...     except SyntaxError:
>> ...             pass
>> ...     else:
>> ...             print(c, unicodedata.name(c))
>> ...
>> ℘ SCRIPT CAPITAL P
>> py>
> 
> This is actually not a bug in Python, but a quirk in Unicode.
> 
> I've had a closer look at PEP 3131 [1], which specifies that Python
> identifiers follow the Unicode classes XID_Start and XID_Continue. ℘ is
> listed in the standard [2][3] as XID_Start, so Python correctly accepts
> it as an identifier.
> 
>>>> import unicodedata
>>>> all_unicode = map(chr, range(0x110000))
>>>> for c in all_unicode:
> ...     category = unicodedata.category(c)
> ...     if not category.startswith('L') and category != 'Nl': # neither
> letter nor letter-number
> ...         if c.isidentifier():
> ...             print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c)))
> ...
> _    U+005F    LOW LINE
> ℘    U+2118    SCRIPT CAPITAL P
> ℮    U+212E    ESTIMATED SYMBOL
>>>>
> 
> ℘ and ℮ are actually explicitly mentioned in the Unicode annnex [3]:
> 
>>
>>       2.5Backward Compatibility
>>
>> Unicode General_Category values are kept as stable as possible, but
>> they can change across versions of the Unicode Standard. The bulk of
>> the characters having a given value are determined by other
>> properties, and the coverage expands in the future according to the
>> assignment of those properties. In addition, the Other_ID_Start
>> property provides a small list of characters that qualified as
>> ID_Start characters in some previous version of Unicode solely on the
>> basis of their General_Category properties, but that no longer qualify
>> in the current version. These are called /grandfathered/ characters.
>>
>> The Other_ID_Start property includes characters such as the following:
>>
>>     U+2118 ( ℘ ) SCRIPT CAPITAL P
>>     U+212E ( ℮ ) ESTIMATED SYMBOL
>>     U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
>>     U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
>>
>>
> I have no idea why U+309B and U+309C are not accepted as identifiers by
> Python 3.5. This could be a question of Python following an old version
> of the Unicode standard, or it *could* be a bug.
> 
[snip]

U+309B and U+309C have had the property ID_Start since at least Unicode 
6.0 (August 2010).

Interestingly, '_' doesn't have that property, although Python does 
allow identifiers to start with it.