[issue24194] tokenize fails on some Other_ID_Start or Other_ID_Continue

Wed Mar 14 20:55:40 EDT 2018

Terry J. Reedy <tjreedy at udel.edu> added the comment:

I closed #1693050 as a duplicate of #12731 (the /w issue).  I left #9712 closed and closed #32987 and marked both as duplicates of this.

In msg313814 of the latter, Serhiy indicates which start and continue identifier characters are currently matched by \W for re and regex.  He gives there a fix for this that he says requires the /w issue to be fixed. It is similar to the posted patch.  He says that without \w fixed, another 2000+ chars need to be added.  Perhaps the v0 patch needs more tests (I don't know.)

He also says that re support for properties, #12734,  would make things even better.

Three of the characters in the patch are too obscure for Firefox on Window2 and print as boxes.  Some others I do not recognize.  And I could not type any of them.  I thought we had a policy of using \u or \U escapes even in tests to avoid such problems.  (I notice that there are already non-ascii chars in the context.)

----------
nosy: +terry.reedy
title: tokenize yield an ERRORTOKEN if an identifier uses Other_ID_Start or Other_ID_Continue -> tokenize fails on some Other_ID_Start or Other_ID_Continue
versions: +Python 3.7, Python 3.8 -Python 3.5

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue24194>
_______________________________________