[Python-ideas] π = math.pi
Thomas Jollans
tjol at tjol.eu
Sat Jun 3 14:50:28 EDT 2017
On 03/06/17 18:48, Steven D'Aprano wrote:
> On Sun, Jun 04, 2017 at 02:36:50AM +1000, Steven D'Aprano wrote:
>
>> But Python 3.5 does treat it as an identifier!
>>
>> py> ℘ = 1 # should be a SyntaxError ?
>> py> ℘
>> 1
>>
>> There's a bug here, somewhere, I'm just not sure where...
> That appears to be the only Symbol Math character which is accepted as
> an identifier in Python 3.5:
>
> py> import unicodedata
> py> all_unicode = map(chr, range(0x110000))
> py> symbols = [c for c in all_unicode if unicodedata.category(c) == 'Sm']
> py> len(symbols)
> 948
> py> ns = {}
> py> for c in symbols:
> ... try:
> ... exec(c + " = 1", ns)
> ... except SyntaxError:
> ... pass
> ... else:
> ... print(c, unicodedata.name(c))
> ...
> ℘ SCRIPT CAPITAL P
> py>
This is actually not a bug in Python, but a quirk in Unicode.
I've had a closer look at PEP 3131 [1], which specifies that Python
identifiers follow the Unicode classes XID_Start and XID_Continue. ℘ is
listed in the standard [2][3] as XID_Start, so Python correctly accepts
it as an identifier.
>>> import unicodedata
>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
... category = unicodedata.category(c)
... if not category.startswith('L') and category != 'Nl': # neither
letter nor letter-number
... if c.isidentifier():
... print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c)))
...
_ U+005F LOW LINE
℘ U+2118 SCRIPT CAPITAL P
℮ U+212E ESTIMATED SYMBOL
>>>
℘ and ℮ are actually explicitly mentioned in the Unicode annnex [3]:
>
> 2.5Backward Compatibility
>
> Unicode General_Category values are kept as stable as possible, but
> they can change across versions of the Unicode Standard. The bulk of
> the characters having a given value are determined by other
> properties, and the coverage expands in the future according to the
> assignment of those properties. In addition, the Other_ID_Start
> property provides a small list of characters that qualified as
> ID_Start characters in some previous version of Unicode solely on the
> basis of their General_Category properties, but that no longer qualify
> in the current version. These are called /grandfathered/ characters.
>
> The Other_ID_Start property includes characters such as the following:
>
> U+2118 ( ℘ ) SCRIPT CAPITAL P
> U+212E ( ℮ ) ESTIMATED SYMBOL
> U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
> U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
>
>
I have no idea why U+309B and U+309C are not accepted as identifiers by
Python 3.5. This could be a question of Python following an old version
of the Unicode standard, or it *could* be a bug.
Thomas
[1]
https://www.python.org/dev/peps/pep-3131/#specification-of-language-changes
[2] http://www.unicode.org/Public/4.1.0/ucd/DerivedCoreProperties.txt
[3] http://www.unicode.org/reports/tr31/
More information about the Python-ideas
mailing list