[Python-ideas] Incorporating something like byteplay into the stdlib

Thu Feb 11 22:58:45 EST 2016

tl;dr: We should turn dis.Bytecode into a builtin mutable structure similar to byteplay.Code, to make PEP 511 bytecode transformers implementable.

Why?

----

The first problem is that the API for bytecode transformers is incomplete. You don't get, e.g., varnames, so you can't even write a transformer that looks up or modifies locals. But that part is easy to fix, so I won't dwell on it.

The second problem is that bytecode is just painful to work with. The peephole optimizer deals with this by just punting and returning the original code whenever it sees anything remotely complicated (which I don't think we want to encourage for all bytecode transformers), and it's _still_ pretty hairy code. And it's the kind of code that can easily harbor hard-to-spot and harder-to-debug bugs (line numbers occasionally off on tracebacks, segfaults on about 1/255 programs that do something uncommon, that kind of fun stuff).

The compiler is already doing this work. There's no reason every bytecode processor should have to repeat all of it. I'm not so worried about performance here--technically, fixup is worst-case quadratic, but practically I doubt we spend much time on it--but simplicity. Why should everyone have to repeat dozens of lines of complicated code to write a simple 10-line transformer?

What?
-----

I played around with a few possibilities from fixed-width bytecode and uncompressed lnotab to a public version of the internal assembler structs, but I think the best one is a flat sequence of instructions, with pseudo-instructions for labels and line numbers, and jump targets just references to those label instructions. 

The compiler, instead of generating and fixing up bytecode and then calling PyCode_Optimize, would generate this structure, call PyCode_Optimize, and then generate and fix up the result.

The object should look a lot like dis.Bytecode, which is basically an iterable of Instruction objects with some extra attributes. But not identical to dis.Bytecode.

 * It needs a C API, and probably a C implementation.
 * It should be a mutable sequence, not just an iterable.
 * Instruction objects should be namedtuples.
 * It should be possible to replace one with a plain (op, arg) tuple (as with tokenizer.Token objects).
 * It should be possible to return any iterable of such tuples instead of a Bytecode (again like tokenizer.untokenize).
 * It doesn't need the "low-level" stuff that dis carries around, because that low level doesn't exist yet. For example, a dis.Instruction for a LOAD_CONST has the actual const value in argval, but it also has the co_consts index in arg. We don't need the latter.
 * Instead of separate opname and opval fields (and opmap and opname to map between them), we should just make the opcodes an enum. (Actually, this one could probably be pulled out into a standalone suggestion for the dis module.)

I think we should just modify dis.Bytecode to handle this use case as well as the existing one (which means it would retain things like Instruction.arg and Bytecode.first_line, they'll just be None if the object wasn't created from a live code object--just like the existing Bytecode.current_offset is None if it's created from a code or function instead of a frame or traceback).

The reason we want a mutable sequence is that it makes the most sense for processors that do random-access-y stuff. (Also, as Serhiy suggested, if the fixup code does NOP removal, this can be good for some simple processors, too--you can do bytecode[i] = (dis.NOP, None) while in the middle of iterating bytecode.)

But for simpler one-pass processors, yielding instructions is really handy:

    def constify_globals(bytecode: dis.Bytecode):
        for op, arg in bytecode:
            if op == dis.LOAD_GLOBAL:
                yield (dis.LOAD_CONST, eval(arg, globals()))
            else:
                yield (op, arg)

    def eliminate_double_jumps(bytecode: dis.Bytecode):
        for instr in bytecode:
            if dis.hasjump(instr.opcode):
                target = bytecode[instr.argval.offset + 1]
                if target.opcode in {dis.JUMP_FORWARD, dis.JUMP_ABSOLUTE}:
                    yield (instr.opcode, target.argval)
                    continue
            yield instr

Anyway, I realize this API is still a little vague, but if people don't hate the idea, I'll try to finish up both the design and the proof of concept this weekend. But obviously, any suggestions or holes poked in the idea would be even better before I do that work than after.

Byteplay?
---------

Once I started building a proof of concept, I realized what I was building is almost exactly the functionality of byteplay, but with a more dis-like interface, except without the to_code method.

Of course that method is the hard part--but it's pretty much the same code I already wrote in the compiler; it's just a matter of separating it out and exposing it nicely.

Byteplay.to_code does one additional thing the compiler doesn't do: it verifies that the stack effects of the assembled code are balanced. And, while this used to be a huge pain, the 3.4+ version (which leans on the dis module) is a lot simpler. So, maybe that's worth adding to.

One thing Byteplay.from_code does that I _don't_ think we want is recursively transforming any code consts into Byteplay objects. (Because the compiler works inside-out; your nested functions are already optimized by the time you're being called.)

Anyway, the advantage here would be that import hooks and decorators can use the exact same API as PEP 511 transformers, except they have to call to_code at the end instead of just returning an object and letting the compiler call it for them. (They'll also probably have to recurse manually on (dis.Bytecode(const) for const in co_consts if isinstance(const, types.CodeType)), while PEP 511 transformers won't.)

Alternatives
------------

We could just pass code objects (or all the separate pieces, instead of some of them), and then the docs could suggest using byteplay for non-trivial bytecode transformers, and then everyone will just end up using byteplay.

So, what's wrong with that? The biggest problem is that, after each new Python release, anyone using a bytecode transformer will have to wait until byteplay is updated before they can update Python. Also, some people are going to try to do it from scratch instead of adding a dependency, and get it wrong, and publish processors that work on most code but in weird edge cases cause a SystemError or even a segfault which will be a nightmare to debug. And it's annoying that anyone who wants to hack on the bytecode representation pretty much has to disable the peephole optimizer.

Or we could just remove bytecode transformers from PEP 511. PEP 511 still seems worth doing to me, even if it only has AST transformers, especially since all or nearly all of the examples anyone's come up with for it are implementable (and easier to implement) at the AST level.