[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

Sat Aug 31 12:30:04 EDT 2019

Bob Kline <bkline at rksystems.com> added the comment:

Ah, this is worse than I first thought. It's not just converting code by adding extra backslashes to regular expression strings, where at least the regular expression engine will do what the original code was asking the Python parser to do (unless user code checks for and enforces limits on regular expression string lengths, so even that case is broken), but 2to3 is also mangling strings in places where the behavior is changed (that is, broken). 2to3 wants to change

    if c not in ".-_:\u00B7\u0e87":

to

    if c not in ".-_:\\u00B7\\u0e87":

Not the same thing at all, as illustrated here:

$ python
Python 3.7.3 (default, Jun 19 2019, 07:38:49)
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> len("\u00B7")
1
>>> len("\\u00B7")
6
>>>

That breaks the original code. This is a serious bug.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue37996>
_______________________________________