[Python-ideas] Looking for people interested in a Python register virtual machine project

Tue Mar 23 15:39:55 EDT 2021

> a) You're working with CPython bleeding edge.
> b) You find that (bleeding edge) adding extra chore.
> c) Nobody told you to work on bleeding (nor boss, nor a maintainer who
> said "I'll merge it once you've done"),
>
> Then: why do you complicate your task by working on bleeding edge?
> Could take not too old CPython version, e.g. 3.8/3.9, instead, and work
> with that.

I started this in the 3.9 alpha timeframe. Staying up-to-date wasn't
too difficult. It never occurred to me that the virtual machine would
undergo so much churn for 3.10, so I just stuck with main/master and
paid the price when the changes started to arrive in earnest. When the
3.10 branch is created I will take that off-ramp this time.

> Btw, from just "pep", it's unclear if, and how much, you reuse Victor's
> work. If not, why? (The answer is useful to contributors - you ask them
> to "reuse" your code - how it's regarding your reuse of code of folks
> who were before you?).

Both Victor's and my earlier work took place in the dim dark past.
There have been so many functional changes to the virtual machine that
directly reusing either old code base wasn't feasible. I do have a
copy of Victor's work though which I have referred to from
time-to-time. I just never tried to merge it with something recent,
like 3.9.

> > > explicitly. Would be interesting to read (in the following "pep"
> > > sections) what makes them "almost completely distinct".
> >
> > Well, sure. The main difference is the way two pairs of instructions
> > (say, BINARY_ADD vs BINARY_ADD_REG) get their operands and save their
> > result. You still have to be able to add two objects, call functions,
> > etc.
>
> I'd note that your reply skips answering question about calling
> convention for register-based VM, and that's again one of the most
> important questions (and one I'd be genuinely interested to hear).

I've not attempted to make any changes to calling conventions. It
occurred to me that the LOAD_METHOD/CALL_METHOD pair could perhaps be
merged into a single opcode, but I haven't really thought about that.
Perhaps there's a good reason the method is looked up before the
arguments are pushed onto the stack (call_function()?). In a
register-based VM there's no need to do things in that order.

> > The fact that I treat the current frame's stack space as registers
> > makes it pretty much impossible to execute both stack and register
> > instructions within the same frame.
>
> I don't see how that would be true (in general, I understand that you
> may have constraints re: that, but that's exactly why I bring that up -
> why do you have constraints like that?). Even existing Python VM allows
> to use both in the same frame, e.g. LOAD_FAST. It takes value of
> register and puts it on a stack.

Sure, but that's because it must. All operands must be on the stack.
My code does have a step where it tries to remove LOAD_FAST_REG and
STORE_FAST_REG opcodes. It's not very good though. Pesky implicit
references cause problems. Still, I am able to remove some of them.
This should get better over time. And, it is possible that at some
point I decide to add back in some stack space for stuff like calling
functions, constructing lists, etc.

> Do you mean details like need to translate stack-based instructions
> into 2 (or more) instructions of: a) actual register-register
> instruction and; b) stack pointer adjustment, so stack-based
> instructions still kept working?

Yes, but I see (I think) what you're getting at. If I continued to
maintain the stack pointer, in theory stack opcodes could exist along
with register opcodes.

> Yes, you would need to do that, until you fully switch to
> register-based ones. But then there's 2 separate tasks:
>
> 1. Make register VM work. (Should be medium complexity.)
> 2. Make it fast. (Likely will be hard.)

I'm not too worried about #2 yet. :-) And as demonstrated by the
current project's incompleteness, either I'm not up to medium
complexity tasks anymore or it's harder than you think. :-)

Some of the second step isn't too hard, I don't think. I already
mentioned eliding generated fast LOAD/STORE instructions, and in my
previous email mentioned copying constants from the code object to the
frame object on creation. I also think opcode prediction and fast
dispatch should be straightforward. I just haven't bothered with that
yet.

> If you want to achieve both right from the start - oh-oh, that may be
> double-hard.
>
> > Victor's implementation did things
> > differently in this regard. I believe he just allocated extra space
> > for 256 registers at the end of each frame, so (in theory, I suppose),
> > you could have instructions from both executed in the same frame.
>
> I hope you have a plan of how to deal with more than 256 registers,
> etc. Register VM adds a lot of accidental implementation complexity ;-).

One of the reasons I just reused the current stack space as my
register space harkens back to a thread with Tim Peters back in the
earliest days of my original work. He indicated that it was possible
to use no more space than the allocated stack space. That worked for
me then. (I spent a few hours one day looking for that thread, but
never found it.)

> > > > ## Motivation
> > >
> > > I'm not sure the content of the section corresponds much to its
> > > title. It jumps from background survey of the different Python VM
> > > optimizations to (some) implementation details of register VM -
> > > leaving "motivation" somewhere "between the lines".
> > >
> > > > Despite all that effort, opcodes which do nothing more than move
> > > > data onto or off of the stack (LOAD_FAST, LOAD_GLOBAL, etc) still
> > > > account for nearly half of all opcodes executed.
> > >
> > > ... And - you intend to change that with a register VM? In which
> > > way and how? As an example, LOAD_GLOBAL isn't going anywhere - it
> > > loads a variable by *symbolic* name into a register.
> >
> > Certainly, if you have data which isn't already on the stack, you are
> > going to have to move data.
>
> Even if you have data in registers, you still may need to move it
> around to accommodate special conventions of some instructions.

Yes. In particular, function calling and construction of lists (and
similar collections) requires arguments to be in order in contiguous
locations.

> > As the appendix shows though, a fairly
> > large chunk of the current virtual machine does nothing more than
> > manipulate the stack (LOAD_FAST, STORE_FAST, POP_TOP, etc).
> >
> > > > Running Pyperformance using a development version of Python 3.9
> > > > showed that the five most frequently executed pure stack opcodes
> > > > (LOAD_FAST, STORE_FAST, POP_TOP, DUP_TOP and ROT_TWO) accounted
> > > > for 35% of all executed instructions.
> > >
> > > And you intend to change that with a register VM? How?
> >
> > I modified the frame so that (once again) the current local variable
> > space adjoins the current stack space (which, again, I treat as
> > registers). The virtual machine can thus access local variables in
> > place.
>
> For simple instructions, yes. What about instructions with multiple
> (arbitrarily number) arguments, like CALL_FUNCTION*/CALL_METHOD*,
> BUILD_*, etc. (That's the same question as asked in the first mail and
> not answered here.)

CALL_FUNCTION and one or two variants as well as BUILD_* are done and
work. I have to be careful when removing unneeded LOAD/STORE
instructions that I pay attention to these implicit references (e.g.,
"%r1 and the next three slots" instead of "%r1, %r2, %r3, %r4").

> > In retrospect, I suspect it might not have been necessary.
> > During the current phase where I've yet to implement any *_DEREF_REG
> > instructions, it would be a moot point though. Still, I'm not sure the
> > cell/free slots have the same semantics as locals/stack (an area of my
> > expertise which is lacking). Isn't there an extra level of indirection
> > there? In any case, if the cell/free slots are semantically just like
> > locals, then it would be straightforward for me to restore the order
> > of the data blocks in frames.
>
> Yes, that's the point - semantically they're just locals, even though
> they are usually accessed with an extra level of indirection.
>
> > > Quick google search leads to
> > > https://www.strchr.com/x86_machine_code_statistics (yeah, that's not
> > > VM, it's RM (real machine), stats over different VMs would be
> > > definitely welcome):
> > >
> > > > The most popular instruction is MOV (35% of all instructions).
> > >
> > > So, is the plan to replace 35% of "five most frequently executed
> > > pure stack opcodes" with 35% of register-register move
> > > instructions? If not, why it would be different and how would you
> > > achieve that?
> >
> > I have clearly not explained myself very well in the "PEP".
>
> Well, it seems to be written with an idea that a reader is already
> familiar with the benefits of register-based VMs. As a fresh reader, I
> tried to point out that fact. I also happen to be familiar with those
> benefits, and the fact that "on average", register-based VMs are faster
> than stack-based. But that only makes me ask why do you think that
> those "on average" benefits would apply to Python VM case, and ask for
> additional details regarding more complex cases with it (which would
> IMHO noticeably cancel any RVM benefits, unless you have a cunning,
> well-grounded plan to deal with them).

Yeah, that's a definite shortcoming. I will try to add some more
introductory material. Also, as I was implementing things and
stumbling on things like implicit register references I was not going
back and adding content to the document.

I'm going to stop here and see if I can improve the document while
some things are fresh in my mind. I will leave the rest for another
day. You've given me plenty to think about.

Thanks,

Skip

> > I will
> > rework that section. Still though, local variables and stack space are
> > adjacent in my implementation, so local variables can be addressed
> > directly without needing to first copy them onto the stack (or into a
> > register).
>
> Only for simple operations. Where idyllic picture for RISC CPUs with
> their uniform register files breaks is function calling (Python analog:
> CALL_FUNCTION). For not purely RISC CPUs, additional put-down are
> instructions with adhoc register constraints (Python analog would be
> BUILD_LIST, etc.)
>
> > Clearly, global variables must still be copied into and out
> > of registers to manipulate. I may well replicate the code object's
> > constants in the frame as well so they can also be treated as
> > (read-only) registers. Since FrameObjects are cached for reuse with
> > the same code object, the cost to copy them should be bearable.
> >
> > > > They are low-cost instructions (compared with CALL_FUNCTION for
> > > > example), but still eat up time and space in the virtual machine
> > >
> > > But that's the problem of any VM - it's slow by definition. There
> > > can be less slow and more slow VMs, but VMs can't be fast. So,
> > > what's the top-level motivation - is it "making CPython fast" or
> > > "making CPython a little bit less slow"? By how much?
> >
> > The top-level motivation is to have fun. Need there be anything more
> > ambitious?
>
> Well, my point is that if you ask for contributors, it's fair to be
> fair with them ;-). If you ask them to join the fun, it's fair, and
> then it makes sense to maximize level of fun and minimize level of
> chore (no bleeding edge chasing, minimize changes overall, use
> incremental development, where you always have something working, not
> "we'll get that working in a year if we pull well every day").
>
> Otherwise, maybe it would be useful to have more objective
> criteria/goal, like "let's try to make Python faster", and then it
> makes sense to explain why you think register VM would be faster in
> general, and in Python case specifically. (The work process can still
> be fun-optimized, chore-minimizing ;-) ).
>
> > Still, point taken. Had I continued to pursue this back in
> > the early 2000s, or had Victor succeeded in getting his implementation
> > into the core, we might be much further along. The current problem is
> > made all the more difficult by the fact that the virtual machine has
> > grown so much in the past 20+ years.
> >
> > > > Consider the layout of the data section of a Frame object:
> > > > All those LOAD_FAST and STORE_FAST instructions just copy pointers
> > > > between chunks of RAM which are just a few bytes away from each
> > > > other in memory.
> > >
> > > Ok, but LOAD_DEREF and STORE_DEREF instructions also just copy
> > > pointers (with extra dereferencing, but that's a detail). It's
> > > unclear why you try to ignore them ("cell" registers), putting
> > > ahead "locals" and "stack" registers. The actual register
> > > instructions implementation would just treat any frame slot as a
> > > register with continuous numbering, allowing to access all of
> > > locals, cells, and stack locs in the same way. In that regard,
> > > trying to rearrange 3 groups at this stage seems like rather
> > > unneeded implementation complexity with no clear motivation.
> >
> > I haven't even looked at LOAD_DEREF or STORE_DEREF yet. I think that
> > extra dereferencing will be more than a simple detail though. That
> > makes the semantics of cell/free slots different than locals/registers
> > slots (more like globals). If true, then my reordering of the frame
> > data is worthwhile, I think.
>
> Until you have benchmarking data that proves that on a *specific CPU*
> one layout is 0.39% faster than the other (will be different and even
> opposite on another arch/CPU), it's all guesswork, I'm afraid. And extra
> changed code to maintain, which spoils the fun ;-).