[Python-ideas] Hooking between lexer and parser

Sun Jun 7 15:05:26 CEST 2015

On Jun 6, 2015, at 22:59, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
> On 7 June 2015 at 08:52, Andrew Barnert via Python-ideas
> <python-ideas at python.org> wrote:
>> Also, if we got my change, I could write code that cleanly hooks parsing in
>> 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people can
>> at least use it, and all of the relevant and complicated code would be
>> shared between the two versions. With your change, I'd have to write code
>> that was completely different for 3.6+ than what I could backport, meaning
>> I'd have to write, debug, and maintain two completely different
>> implementations. And again, for no benefit.
> 
> I don't think I've said this explicitly yet, but I'm +1 on the idea of
> making it easier to "hack the token stream". As Andew has noted, there
> are two reasons this is an interesting level to work at for certain
> kinds of modifications:
> 
> 1. The standard Python tokeniser has already taken care of converting
> the byte stream into Unicode code points, and the code point stream
> into tokens (including replacing leading whitespace with the
> structural INDENT/DEDENT tokens)

Actually, as I discovered while trying to hack in the change this afternoon, the C tokenizer doesn't actually take care of conveying the byte stream. It does take care of detecting the encoding, but what it hands to the parsetok function is still encoded bytes.

The Python wrapper does transparently decode for you (in 3.x), but that actually just makes it harder to feed the output back into the parser, because the parser wants encoded bytes. (Also, as I mentioned before, it would be nice if the Python wrapper could just take Unicode in the first place, because the most obvious place to use this is in an import hook, where you can detect and decode the bytes yourself in as single line, and it's easier to just use the string than to encode it to UTF-8 so the tokenizer can detect UTF-8 so either the Python tokenizer wrapper or the C parser can decode it again...).

Anyway, this part was at least easy to temporarily work around; the stumbling block that prevented me from finishing a working implementation this afternoon is a bit hairier. The C tokenizer hands the parser the current line (which can actually be multiple lines) and start and end pointers to characters within that line. It also hands it the current token string, but the parser ignores that and just reads from line+start to line+end. The Python tokenizer, on the other hand, gives you line number and (Unicode-based) column numbers for start and end. Converting those to encoded-bytes offsets isn't _that_ hard... but those are offsets into the original (encoded) line, so the parser is going to see the value of the original token rather than the token value(s) you're trying to substitute, which defeats the entire purpose.

I was able to implement a hacky workaround using untokenize to fake the current line and provide offsets within that, but that means you get garbage from SyntaxErrors, and all your column numbers--and, worse, all your line numbers, if you add in a multi-line token--are off within the AST and bytecode. (And there may be other problems; those are just the ones I saw immediately when I tried it...)

I think what I'm going to try next is to fork the whole parsetok function and write a version that uses the token's string instead of the substring of the line, and start and stop as offsets instead of pointers. I'm still not sure whether the token string and line should be in tok->encoding, UTF-8, UTF-32, or a PyUnicode object, but I'll figure that out as I do it.... Once I get that working for the wrapped-up token iterator, then I can see if I can reunify it with the existing version for the C tokenizer (without any performance penalty, and without breaking pgen). I'd hate to have two copies of that giant function to keep in sync.

Meanwhile, I'm not sure what to do about tokens that don't have the optional start/stop/line values. Maybe just not allow them (just because untokenize can handle it doesn't mean ast.parse has to), or maybe just untokenize a fake line (and if any SyntaxErrors are ugly and undebuggable, well, don't skip those values...). The latter might be useful if for some reason you wanted to generate tokens on the fly instead of just munging a stream of tokens from source text you have available.

I'm also not sure what to do about a few error cases. For example, if you feed the parser something that isn't iterable, or whose values aren't an iterable of iterables of length 2 to 5 with the right types, that really feels more like a TypeError than a SyntaxError (and that would also be a good way to signal the end user that the bug is in the token stream transformer rather than in the source code...), but raising a TypeError from within the parser requires a bit more refactoring (the tokenizer can't tell you what error to raise, just that the current token is an error along with a tokenizer error code--although I code add an E_NOTOK error code that the parser interprets as "raise a TypeError instead of a SyntaxError"), and I'm not sure whether that would affect any other stuff. Anyway, for the first pass I may just leave it as a SyntaxError, just to get something working.

Finally, it might be nice if it were possible to generate a SyntaxError that showed the original source line but also told you that the tokens don't match the source (again, to signal the end user that he should look at what the hook did to his code, not just his code), but I'm not sure how necessary that is, or how easy it will be (it depends on how I end up refactoring parsetok).

> If all you're wanting to do is token rewriting, or to push the token
> stream over a network connection in preference to pushing raw source
> code or fully compiled bytecode

I didn't think about that use case at all, but that could be very handy.