[Python-ideas] Looking for people interested in a Python register virtual machine project

Mon Mar 22 18:13:19 EDT 2021

Thanks for the response. I will try to address your comments inline.

> I guess it should be a good idea to answer what's the scope of this
> project - is it research one or "production" one? If it's research one,
> why be concerned with the churn of over-modern CPython versions?
> Wouldn't it be better to just use some scalable, incremental
> implementation which would allow to forward-port it to a newer version,
> if it ever comes to that?

The motivation for revisiting this idea was/is largely personal. As I
indicated, I first messed around with it over 20 years ago and it's
been in the back of my mind ever since. Somehow I never lost the code
despite I'm not sure how many computers came and went and that the
code was never uploaded to any sort of distributed version control
system. I decided to pick things up again as a way to mostly keep my
head in the game after I retired. So, neither "research" nor
"production" seems to be a correct descriptor. Still, if taken to
functional completion — functional enough for performance testing and
application to more than just toy scripts — I realized pretty quickly
that I'd need help.

> Otherwise, if it's "production", who's the "customer" and how they
> "compensate" you for doing work (chasing the moving target) which is
> clearly of little interest to you and conflicts with the goal of the
> project?

Nobody is compensating me. I have no desire to try and turn it into
something I do for hire. Maybe I misunderstood your question?

> > This PEP proposes the addition of register-based instructions to the
> > existing Python virtual machine, with the intent that they eventually
> > replace the existing stack-based opcodes.
>
> Sorry, what? The purpose of register-based instructions is to just
> replace stack-based instructions? That's not what's I'd like to hear as
> the intro phrase. You probably want to replace one with the other
> because register-based ones offer some benefit, faster execution
> perhaps? That's what I'd like to hear instead of "deciphering" that
> between the lines.

Replacing stack-based instructions would be a reasonable initial goal,
I think. Victor reported performance improvements in his
implementation (also a translator). As I indicated in the "PEP" (I use
that term rather loosely, as I have no plans at the moment to submit
it for consideration, certainly not in its current, incomplete state),
a better ultimate way to go would be to generate register instructions
directly from the AST. The current translation scheme allows me to
write simple test case functions, generate register instructions, then
compare that when called the two produce the same result.

> > They [2 instruction sets] are almost completely distinct.
>
> That doesn't correspond to the mental image I would have. In my list,
> the 2 sets would be exactly the same, except that stack-based encode
> argument locations implicitly, while register-based - explicitly. Would
> be interesting to read (in the following "pep" sections) what makes them
> "almost completely distinct".

Well, sure. The main difference is the way two pairs of instructions
(say, BINARY_ADD vs BINARY_ADD_REG) get their operands and save their
result. You still have to be able to add two objects, call functions,
etc.

> > Within a single function only one set of opcodes or the other will
> > be used at any one time.
>
> That would be the opposite of "scalable, incremental" development
> approach mentioned above. Why not allow 2 sets to freely co-exist, and
> migrate codegeneration/implement code translation gradually?

The fact that I treat the current frame's stack space as registers
makes it pretty much impossible to execute both stack and register
instructions within the same frame. Victor's implementation did things
differently in this regard. I believe he just allocated extra space
for 256 registers at the end of each frame, so (in theory, I suppose),
you could have instructions from both executed in the same frame.

> > ## Motivation
>
> I'm not sure the content of the section corresponds much to its title.
> It jumps from background survey of the different Python VM optimizations
> to (some) implementation details of register VM - leaving "motivation"
> somewhere "between the lines".
>
> > Despite all that effort, opcodes which do nothing more than move data
> > onto or off of the stack (LOAD_FAST, LOAD_GLOBAL, etc) still account
> > for nearly half of all opcodes executed.
>
> ... And - you intend to change that with a register VM? In which way and
> how? As an example, LOAD_GLOBAL isn't going anywhere - it loads a
> variable by *symbolic* name into a register.

Certainly, if you have data which isn't already on the stack, you are
going to have to move data. As the appendix shows though, a fairly
large chunk of the current virtual machine does nothing more than
manipulate the stack (LOAD_FAST, STORE_FAST, POP_TOP, etc).

> > Running Pyperformance using a development version of Python 3.9
> > showed that the five most frequently executed pure stack opcodes
> > (LOAD_FAST, STORE_FAST, POP_TOP, DUP_TOP and ROT_TWO) accounted for
> > 35% of all executed instructions.
>
> And you intend to change that with a register VM? How?

I modified the frame so that (once again) the current local variable
space adjoins the current stack space (which, again, I treat as
registers). The virtual machine can thus access local variables in
place. In retrospect, I suspect it might not have been necessary.
During the current phase where I've yet to implement any *_DEREF_REG
instructions, it would be a moot point though. Still, I'm not sure the
cell/free slots have the same semantics as locals/stack (an area of my
expertise which is lacking). Isn't there an extra level of indirection
there? In any case, if the cell/free slots are semantically just like
locals, then it would be straightforward for me to restore the order
of the data blocks in frames.

> Quick google search leads to
> https://www.strchr.com/x86_machine_code_statistics (yeah, that's not
> VM, it's RM (real machine), stats over different VMs would be
> definitely welcome):
>
> > The most popular instruction is MOV (35% of all instructions).
>
> So, is the plan to replace 35% of "five most frequently executed pure
> stack opcodes" with 35% of register-register move instructions? If not,
> why it would be different and how would you achieve that?

I have clearly not explained myself very well in the "PEP". I will
rework that section. Still though, local variables and stack space are
adjacent in my implementation, so local variables can be addressed
directly without needing to first copy them onto the stack (or into a
register). Clearly, global variables must still be copied into and out
of registers to manipulate. I may well replicate the code object's
constants in the frame as well so they can also be treated as
(read-only) registers. Since FrameObjects are cached for reuse with
the same code object, the cost to copy them should be bearable.

> > They are low-cost instructions (compared with CALL_FUNCTION for
> > example), but still eat up time and space in the virtual machine
>
> But that's the problem of any VM - it's slow by definition. There can
> be less slow and more slow VMs, but VMs can't be fast. So, what's the
> top-level motivation - is it "making CPython fast" or "making CPython a
> little bit less slow"? By how much?

The top-level motivation is to have fun. Need there be anything more
ambitious? Still, point taken. Had I continued to pursue this back in
the early 2000s, or had Victor succeeded in getting his implementation
into the core, we might be much further along. The current problem is
made all the more difficult by the fact that the virtual machine has
grown so much in the past 20+ years.

> > Consider the layout of the data section of a Frame object:
> > All those LOAD_FAST and STORE_FAST instructions just copy pointers
> > between chunks of RAM which are just a few bytes away from each other
> > in memory.
>
> Ok, but LOAD_DEREF and STORE_DEREF instructions also just copy pointers
> (with extra dereferencing, but that's a detail). It's unclear why you
> try to ignore them ("cell" registers), putting ahead "locals" and
> "stack" registers. The actual register instructions implementation would
> just treat any frame slot as a register with continuous numbering,
> allowing to access all of locals, cells, and stack locs in the same
> way. In that regard, trying to rearrange 3 groups at this stage seems
> like rather unneeded implementation complexity with no clear motivation.

I haven't even looked at LOAD_DEREF or STORE_DEREF yet. I think that
extra dereferencing will be more than a simple detail though. That
makes the semantics of cell/free slots different than locals/registers
slots (more like globals). If true, then my reordering of the frame
data is worthwhile, I think.

Skip