[Python-ideas] π = math.pi

Sat Jun 3 14:50:28 EDT 2017

On 03/06/17 18:48, Steven D'Aprano wrote:
> On Sun, Jun 04, 2017 at 02:36:50AM +1000, Steven D'Aprano wrote:
>
>> But Python 3.5 does treat it as an identifier!
>>
>> py> ℘ = 1  # should be a SyntaxError ?
>> py> ℘
>> 1
>>
>> There's a bug here, somewhere, I'm just not sure where...
> That appears to be the only Symbol Math character which is accepted as 
> an identifier in Python 3.5:
>
> py> import unicodedata
> py> all_unicode = map(chr, range(0x110000))
> py> symbols = [c for c in all_unicode if unicodedata.category(c) == 'Sm']
> py> len(symbols)
> 948
> py> ns = {}
> py> for c in symbols:
> ...     try:
> ...             exec(c + " = 1", ns)
> ...     except SyntaxError:
> ...             pass
> ...     else:
> ...             print(c, unicodedata.name(c))
> ...
> ℘ SCRIPT CAPITAL P
> py>

This is actually not a bug in Python, but a quirk in Unicode.

I've had a closer look at PEP 3131 [1], which specifies that Python
identifiers follow the Unicode classes XID_Start and XID_Continue. ℘ is
listed in the standard [2][3] as XID_Start, so Python correctly accepts
it as an identifier.

>>> import unicodedata
>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
...     category = unicodedata.category(c)
...     if not category.startswith('L') and category != 'Nl': # neither
letter nor letter-number
...         if c.isidentifier():
...             print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c)))
...
_    U+005F    LOW LINE
℘    U+2118    SCRIPT CAPITAL P
℮    U+212E    ESTIMATED SYMBOL
>>>

℘ and ℮ are actually explicitly mentioned in the Unicode annnex [3]:

>
>       2.5Backward Compatibility
>
> Unicode General_Category values are kept as stable as possible, but
> they can change across versions of the Unicode Standard. The bulk of
> the characters having a given value are determined by other
> properties, and the coverage expands in the future according to the
> assignment of those properties. In addition, the Other_ID_Start
> property provides a small list of characters that qualified as
> ID_Start characters in some previous version of Unicode solely on the
> basis of their General_Category properties, but that no longer qualify
> in the current version. These are called /grandfathered/ characters.
>
> The Other_ID_Start property includes characters such as the following:
>
>     U+2118 ( ℘ ) SCRIPT CAPITAL P
>     U+212E ( ℮ ) ESTIMATED SYMBOL
>     U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
>     U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
>
>
I have no idea why U+309B and U+309C are not accepted as identifiers by
Python 3.5. This could be a question of Python following an old version
of the Unicode standard, or it *could* be a bug.

Thomas

[1]
https://www.python.org/dev/peps/pep-3131/#specification-of-language-changes
[2] http://www.unicode.org/Public/4.1.0/ucd/DerivedCoreProperties.txt
[3] http://www.unicode.org/reports/tr31/