[Python-ideas] π = math.pi

Thomas Jollans tjol at tjol.eu
Sat Jun 3 15:02:37 EDT 2017


On 03/06/17 20:41, Chris Angelico wrote:
> [snip]
> For reference, as well as the 948 Sm, there are 1690 Mn and 5777 So,
> but only these characters are valid from them:
>
> \u1885 Mn MONGOLIAN LETTER ALI GALI BALUDA
> \u1886 Mn MONGOLIAN LETTER ALI GALI THREE BALUDA
> ℘ Sm SCRIPT CAPITAL P
> ℮ So ESTIMATED SYMBOL
>
> 2118 SCRIPT CAPITAL P and 212E ESTIMATED SYMBOL are listed in
> PropList.txt as Other_ID_Start, so they make sense. But that doesn't
> explain the two characters from category Mn. It also doesn't explain
> why U+309B and U+309C are *not* valid, despite being declared
> Other_ID_Start. Maybe it's a bug? Maybe 309B and 309C somehow got
> switched into 1885 and 1886??

\u1885 and \u1886 are categorised as letters (category Lo) by my Python
3.5. (Which makes sense, right?) If your system puts them in category
Mn, that's bound to be a bug somewhere.

As for \u309B and \u309C - it turns out this is a question of
normalisation. PEP 3131 requires NFKC normalisation:

>>> for c in unicodedata.normalize('NFKC', '\u309B'):
...     print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c)))
...
     U+0020    SPACE
    U+3099    COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
>>> for c in unicodedata.normalize('NFKC', '\u309C'):
...     print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c)))
...
     U+0020    SPACE
    U+309A    COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
>>>

This is.... interesting.


Thomas




More information about the Python-ideas mailing list