getting rid of data movement instructions

Sun Aug 19 00:21:07 EDT 2001

    >> If you want some performance boost, think about ways you might delete
    >> about 90% of those instructions. ;-)

    Alex> Hmmm, I'll byte -- by having joined LOAD_this_STORE_that bytecodes
    Alex> (5 kinds of loads, 3 kinds of stores, that's 15 new bytecodes) and
    Alex> a few call-and-poptop ones?  And a peephole optimizer to change
    Alex> the bytecode stream coming from the current compiler to one using
    Alex> the new bytecodes (and making a few more minor opts while he's at
    Alex> it)...?

I'm not sure I can parse that paragraph.

    Alex> Or am I missing something obvious...?  'cause I don't think that
    Alex> would delete 90% of the datamovement instructions -- maybe 30 to
    Alex> 40% or so...?

Well, perhaps 90% is too optimistic, but I think the target should
eventually be closer to 90% than 30%, though it won't be anywere near either
number real soon at the rate I'm going.  Let me expose my quite possibly
flawed thinking on the subject.

While I'm writing this I'm staring at a slightly modified version of a
dynamic opcode frequency table Marc-Andre Lemburg posted to python-dev
awhile back, so I figure others who've read this far might be interested in
it.  I zapped SET_LINENO frequencies because python -O already takes care
of them and attached percentages to his raw counts.

        LOAD_FAST: 19323126 (20.08%)      BINARY_SUBSCR:  1834612 ( 1.91%)
       LOAD_CONST:  9254683 ( 9.62%)      JUMP_ABSOLUTE:  1648327 ( 1.71%)
        LOAD_NAME:  8218954 ( 8.54%)       STORE_SUBSCR:  1446307 ( 1.50%)
      LOAD_GLOBAL:  7174876 ( 7.45%)       JUMP_FORWARD:  1014821 ( 1.05%)
       STORE_FAST:  5927769 ( 6.16%)    BINARY_SUBTRACT:   910085 ( 0.95%)
          POP_TOP:  5587424 ( 5.81%)          POP_BLOCK:   806160 ( 0.84%)
    CALL_FUNCTION:  5404709 ( 5.62%)       STORE_GLOBAL:   779880 ( 0.81%)
    JUMP_IF_FALSE:  5289262 ( 5.50%)           FOR_LOOP:   735245 ( 0.76%)
       COMPARE_OP:  4495179 ( 4.67%)         SETUP_LOOP:   657432 ( 0.68%)
        LOAD_ATTR:  3481878 ( 3.62%)      BINARY_MODULO:   610121 ( 0.63%)
       BINARY_ADD:  3420811 ( 3.55%)            SLICE+2:   530811 ( 0.55%)
     RETURN_VALUE:  2221212 ( 2.31%)            SLICE+1:   530657 ( 0.55%)
       STORE_NAME:  2176228 ( 2.26%)    BINARY_MULTIPLY:   392274 ( 0.41%)
       STORE_ATTR:  2085338 ( 2.17%)       SETUP_EXCEPT:   285523 ( 0.30%)

Here are my current crop of ideas in a nutshell.

Implement PEP 266.  Most global variable accesses and attribute expressions
anchored by global variables (e.g. "math.sin") thus become locals as far as
the virtual machine is concerned.  The small number of STORE_GLOBAL and
STORE_ATTR instructions suggests that my "almost constant" assumption for
globals is a reasonable one.  (It would be nice to have DXPAIR profiling as
well to see how many STORE_ATTRs follow LOAD_GLOBALs.  Anybody wanting to
generate some can check ceval.c for details.  It's not hard to do, you just
need to be able to sacrifice some performance on a significant suite of
programs and instrument your code slightly to save the instruction
frequencies.  Hmmm... Maybe we could create a little dxpair project.  I
could make an XML-RPC server available to which people could dump their
results... hmm... hmm...)

Implement a three-address virtual machine where the current fastlocals+stack
space in frame objects are treated as a single register file.  Most
instructions' sources and destinations will be local variables that will be
read or written by the instruction that uses them, not by another pass
around the interpreter's main loop to copy data onto or off of the stack.
With proper optimization of the bytecode, this is where I expect to see the
biggest reduction in data movement instructions.

Modify frame objects so they have space for the function's constants next to
the fastlocals array.  Initialize that storage at function startup.
Function constants then look like local variables to opcodes that take their
sources from registers.  If initialization is expensive (and I doubt it will
be), you can play a few tricks, such as initializing constant slot 0 (which
is always a reference to Py_None) once per frame object, or caching frame
objects on a per function basis so you can skip constant initialization most
of the time.

Instructions like POP_TOP, DUP_TOP, ROT_TWO, and ROT_THREE effectively
disappear.  (All except POP_TOP occur rarely, anyway.)

Breakage of this model occurs for a number of opcodes such as CALL_FUNCTION,
BUILD_TUPLE, BUILD_LIST and UNPACK_SEQUENCE.  With the exception of
CALL_FUNCTION, they all occur rarely.  I'd like to have some dynamic data
that suggest how many parameters functions take on average.  A quick look at
the static frequencies (gotten by disassembling the pyo files in the top
level of the standard library) suggests that about 75% of functions take 0
or 1 parameters.  Perhaps 5-7% of the LOADs and STOREs (roughly equal to the
number of CALL_FUNCTION instructions) would have to be preserved.  Depending
how smart you were you might or might not be able to elide some of the loads
or stores involved with the CALL_FUNCTION.  (I doubt I'm that smart.)

The line in MAL's table that has me scratching my head is the one that
indicates about 8.5% of all instructions executed in his mix are LOAD_NAMEs.
My understanding is that LOAD_NAME is only generated in the presence of
"from m import *" or "exec".  Either the code that MAL profiled was unusual
in this respect or there's a lot more of this lookup-busting code being
executed than I thought.  Taking a look at my dis output suggests that
LOAD_NAME actually occurs rarely.  Here are the static frequencies of the
various LOAD_* instructions:

    10915       LOAD_FAST
     6704       LOAD_CONST
     4236       LOAD_GLOBAL
     3024       LOAD_ATTR
       17       LOAD_NAME

MAL must have had a very hot bit of code containing LOAD_NAMEs for it to
account for over 8% of his dynamic instruction mix.  Other than that, the
most frequently occurring instructions in both his dynamic table and the
static frequencies I generated are pretty similar.  Here are the top
occurring static instructions:

  10915 LOAD_FAST (18.52%)                 3024 LOAD_ATTR (5.13%)    
   6825 POP_TOP (11.58%)                   2615 JUMP_IF_FALSE (4.44%)
   6704 LOAD_CONST (11.37%)                2430 JUMP_FORWARD (4.12%) 
   4236 LOAD_GLOBAL (7.19%)                2139 RETURN_VALUE (3.63%) 
   4235 CALL_FUNCTION (7.18%)              1753 COMPARE_OP (2.97%)   
   4218 STORE_FAST (7.16%)                  988 BINARY_ADD (1.68%)   

I'm not quite sure what to make of the relatively large number of JUMP
instructions in the static frequencies, but the more frequently they occur,
the smaller the size of the average basic block will be (which limits the
range of instructions over which many optimizations can be made).

The difference between the dynamic and static frequencies of POP_TOP is
accounted for by considering that the compiler generates two POP_TOP
instructions for each conditional jump, one at the head of both branches the
code might take, but obviously only one of them is executed for any given
branch.  (I'm not sure why the conditional branch opcodes don't take care of
popping the stack themselves.)

Well, I've strayed pretty far afield.  If nothing else it should serve to
make some of my reasoning visible for others to critique.  There's obviously
no algorithm here in which I turn the crank and spit out a "90%" reduction
figure.  Still, while ambitious, I think it's a fair upper bound on what's
possible.

Anyone interested working on any of this, let me know.  I have a crude start
on a couple pieces, but have no time to work on it enough to make much
progress.  Every time I put my brain back together after it explodes I'm
drawn away to something else.

but-then-who-isn't?-ly y'rs,

-- 
Skip Montanaro (skip at pobox.com)
http://www.mojam.com/
http://www.musi-cal.com/