[New-bugs-announce] [issue43014] tokenize spends a lot of time in `re.compile(...)`

Sun Jan 24 03:34:14 EST 2021

New submission from Anthony Sottile <asottile at umich.edu>:

I did some profiling (attached a few files here with svgs) of running this script:

```python
import io
import tokenize

# picked as the second longest file in cpython
with open('Lib/test/test_socket.py', 'rb') as f:
    bio = io.BytesIO(f.read())

def main():
    for _ in range(10):
        bio.seek(0)
        for _ in tokenize.tokenize(bio.readline):
            pass

if __name__ == '__main__':
    exit(main())
```

the first profile is before the optimization, the second is after the optimization

The optimization takes the execution from ~6300ms to ~4500ms on my machine (representing a 28% - 39% improvement depending on how you calculate it)

(I'll attach the pstats and svgs after creation, seems I can only attach one file at once)

----------
components: Library (Lib)
files: out.pstats
messages: 385572
nosy: Anthony Sottile
priority: normal
severity: normal
status: open
title: tokenize spends a lot of time in `re.compile(...)`
type: performance
versions: Python 3.10, Python 3.9
Added file: https://bugs.python.org/file49759/out.pstats

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue43014>
_______________________________________