[Python-ideas] Hooking between lexer and parser

Fri Jun 5 11:29:43 CEST 2015

Compiling a module has four steps:

 * bytes->str (based on encoding declaration or default)
 * str->token stream
 * token stream->AST
 * AST->bytecode

You can very easily hook at every point in that process except the token stream.

There _is_ a workaround: re-encode the text to bytes, wrap it in a BytesIO, call tokenize, munge the token stream, call untokenize, re-decode back to text, then pass that to compile or ast.parse. But, besides being a bit verbose and painful, that means your line and column numbers get screwed up. So, while its fine for a quick&dirty toy like my user-literal-hack, it's not something you'd want to do in a real import hook for use in real code.

This could be solved by just changing ast.parse to accept an iterable of tokens or tuples as well as a string, and likewise for compile.

That isn't exactly a trivial change, because under the covers the _ast module is written in C, partly auto-generated, and expects as input a CST, which is itself created from a different tokenizer written in C with an similar but different API (since C doesn't have iterators). And adding a PyTokenizer_FromIterable or something seems like it might raise some fun bootstrapping issues that I haven't thought through yet. But I think it ought to be doable without having to reimplement the whole parser in pure Python. And I think it would be worth doing.

While we're at it, a few other (much smaller) changes would be nice:

 * Allow tokenize to take a text file instead of making it take a binary file and repeat the encoding detection.
 * Allow tokenize to take a file instead of its readline method.
 * Allow tokenize to take a str/bytes instead of requiring a file.
 * Add flags to compile to stop at any stage (decoded text, tokens, AST, or bytecode) instead of just the last two.

(The funny thing is that the C tokenizer actually already does support strings and bytes and file objects.)

I realize that doing all of these changes would mean that compile can now get an iterable and not know whether it's a file or a token stream until it tries to iterate it. So maybe that isn't the best API; maybe it's better to explicitly call tokenize, then ast.parse, then compile instead of calling compile repeatedly with different flags.