[Python-Dev] [ANN] VPython 0.1

Thu Oct 23 09:08:33 CEST 2008

Hey,

I hope you don't mind my replying in digest form.

First off, I guess I should be a little clearer as to what VPthon is
and what it does.

VPython is essentially a set of patches for CPython (in touches only
three files, diff -b is about 800 lines IIRC plus the switch statement
in ceval.c's EvalFrameEx()).

The main change is moving the VM instruction implementations, in
CPython, blocks of code following a case label, into a separate file,
adding Vmgen stack comments, and removing the explicit stack
manipulation code (plus some minor modification like renaming variables
to work with Vmgen's type prefixes and labels to enable the generation
of superinstructions).
Vmgen parses the stack comments and prints some macros around the
provided instruction code (incidentally, this reduced the 1500 line
switch body to about 1000 lines).
Interested parties should consult ceval.vmg and ceval-vm.i.

The nice thing about this is that:

a) It's fairly easy to implement different types of dispatch, simply by
changing a few macros (and while I haven't done this, it shouldn't be a
problem to add some switch dispatch #ifdefs for non-GCC platforms).

In particular, direct threaded code leads to less horrible branch
prediction than switch dispatch on many machines (exactly how
pronounced this effect is depends heavily on the specific
architecture).

b) Vmgen can generate superinstructions.
A quick primer:
A sequence of code such as LOAD_CONST LOAD_FAST BINARY_ADD will, in
CPython, push some constant onto the stack, push some local onto the
stack, then pop both off the stack, add them and push the result back
onto the stack.
Turning this into a superinstruction means inlining LOAD_CONST and
LOAD_FAST, modifying them to store the values they'd otherwise push
onto the stack in local variables and adding a version of BINARY_ADD
which reads its arguments from those local variables rather than the
stack (this reduces dispatch time in addition to pops and pushes).

David Gregg (and friends) recently published a paper comparing stack
based and register based VMs for Java and found that register based VMs
were substantially faster. The main reason for this appears to be the
absence of the various LOAD_ instructions in a register VM. They looked
at mitigating this using superinstructions but Java would have required
(again, IIRC) about a 1000 (which leads to substantial code growth).

Since Python doesn't have multiple (typed) versions of every
instruction (Java has iadd and so on) much fewer superinstructions are
necessary.

On my system, superinstructions account for about 10% of the 30%
performance gain.

As for limitations, as the README points out, currently 2 cases in
test_doctest fail, as well as 1 case in test_hotshot, test_inspect, and
test_subprocess. And most of the cases in test_trace.
The reason for this is, I suspect, that I removed the line tracing code
from ceval.c (I didn't want to look at it detail, and it doesn't seem
to affect anything else). I expect this would be a bit of work to fix
but I don't see it as a huge problem (in fact, if you don't use
settrace(?) it shouldn't affect you?).

Stack caching: a previous version of VPython supported this, but the
performance gain was minimal (maybe 1-2%, though if done really well
(e.g. using x as the top of stack cache), who knows, more may be
possible). Also, it let to some problems with the garbage collector
seeing an out-of-date stack_pointer[-1].

``Cell'' is, unfortunately, hardcoded into Vmgen. I could either patch
that or run ceval-vm.i through sed or something.

Finally, to the people who pointed out that VPython (the name) is
already taken: Darn! I really should have checked that! Will call it
something else in the future.

Anyway, HTH,
-jakob