[issue24194] tokenize fails on some Other_ID_Start or Other_ID_Continue
Terry J. Reedy
report at bugs.python.org
Sun Jan 16 18:10:56 EST 2022
Terry J. Reedy <tjreedy at udel.edu> added the comment:
Udated doc link, which appears to be same:
https://docs.python.org/3.11/reference/lexical_analysis.html#identifiers
Updated property list linked in above:
https://www.unicode.org/Public/14.0.0/ucd/PropList.txt
Relevant content for this issue:
1885..1886 ; Other_ID_Start # Mn [2] MONGOLIAN LETTER ALI GALI BALUDA..MONGOLIAN LETTER ALI GALI THREE BALUDA
2118 ; Other_ID_Start # Sm SCRIPT CAPITAL P
212E ; Other_ID_Start # So ESTIMATED SYMBOL
309B..309C ; Other_ID_Start # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
# Total code points: 6
00B7 ; Other_ID_Continue # Po MIDDLE DOT
0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE
# Total code points: 12
Codepoints of '℘·' opening example:
'0x2118' Other_Id_start Sm Script Capital P
'0xb7' Other_Id_continue P0 Middle dot
Except for the two Mongolian start characters, Meador's patch hardcodes the 'Other' characters, thereby adding them without waiting for re to be fixed. While this will miss new additions without manual updates, it is better than missing everything for however many years. I will make a PR with the additions and looks at the new tests.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue24194>
_______________________________________
More information about the Python-bugs-list
mailing list