[New-bugs-announce] [issue26843] tokenize does not include Other_ID_Start or Other_ID_Continue in identifier

Sun Apr 24 21:58:44 EDT 2016

New submission from Joshua Landau:

This is effectively a continuation of https://bugs.python.org/issue9712.

The line in Lib/tokenize.py

    Name = r'\w+'

must be changed to a regular expression that accepts Other_ID_Start at the start and Other_ID_Continue elsewhere. Hence tokenize does not accept '℘·'.

See the reference here:

    https://docs.python.org/3.5/reference/lexical_analysis.html#identifiers

I'm unsure whether unicode normalization (aka the `xid` properties) needs to be dealt with too.

Credit to toriningen from http://stackoverflow.com/a/29586366/1763356.

----------
components: Library (Lib)
messages: 264145
nosy: Joshua.Landau
priority: normal
severity: normal
status: open
title: tokenize does not include Other_ID_Start or Other_ID_Continue in identifier
type: behavior
versions: Python 3.5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26843>
_______________________________________