[issue3353] make built-in tokenizer available via Python C API

Wed Jan 27 16:14:20 EST 2021

Pablo Galindo Salgado <pablogsal at gmail.com> added the comment:

Problems that you are going to find:

* The c tokenizer throws syntax errors while the tokenizer module does not. For example:

❯ python -c "1_"
  File "<string>", line 1
    1_
     ^
SyntaxError: invalid decimal literal

❯ python -m tokenize <<< "1_"
1,0-1,1:            NUMBER         '1'
1,1-1,2:            NAME           '_'
1,2-1,3:            NEWLINE        '\n'
2,0-2,0:            ENDMARKER      ''

* The encoding cannot be immediately specified. You need to thread it in many places.

* The readline() function can now return whatever or be whatever, that needs to be handled (better) in the c tokenizer to not crash.

* str/bytes in the c tokenizer.

* The c tokenizer does not get the full line in some cases or is tricky to get the full line.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue3353>
_______________________________________