[Python-Dev] Re: Python optmizations

Skip Montanaro skip@pobox.com
Thu, 7 Feb 2002 23:34:10 -0600


Gustavo,

Thanks for the note.  Funny coincidence the timing of your note like a bolt
out of the blue and my attendance at IPC10 where Jeremy Hylton and I led a
session this afternoon on optimization issues.

    Gustavo> Recently, I have found your paper about peephole optimization,
    Gustavo> ...

I should probably check my peephole optimizer into the nondist sandbox on
SF.  It shouldn't take very much effort.  Unlike Rattlesnake, it still
pretty much works.

    Gustavo> ... discovered that I'm not original, and repeated most of your
    Gustavo> ideas and mistakes.

I was not original either.  I'm sure I repeated the mistakes of many other
people as well.

    Gustavo> One thing I thought and also found a reference in your paper is
    Gustavo> about some instructions that should be turned into a single
    Gustavo> opcode. To understand how this would affect the code, I have
    Gustavo> disassembled the whole Python standard library, and the whole
    Gustavo> Zope library. After that I've run a script to detect opcode
    Gustavo> repeatings (excluding SET_LINENO).

It sounds like your measurements were made on the static bytecode.  I
suspect you might find the dynamic opcode pair frequencies interesting as
well.  That's what most people look at when deciding what really follows
what.  (For example, did you consider the basic block structure of the
bytecode or just adjacency in the disassembler output?)  You can get this
information by defining the two macros DXPAIRS and DYNAMIC_EXECUTION_PROFILE
when compiling Python.  I think all you need to recompile is ceval.c and
sysmodule.c Once you've done this, run your scripts as usual, then before
exit (you might just want to run your subject script with the -i flag), call
sys.getdxp().  That will return you a 257x256 list (well, a list containing
257 other lists, each of which contains 256 elements).  The value at
location [i][j] corresponds to the frequency of opcode j being executed
immediately after opcode i (or the other way around - a quick peek at the
code will tell you which is which).

At one point I had an xmlrpc server running on manatee.mojam.com to which
people could submit such opcode frequency arrays, however nobody submitted
anything to it, so I eventually turned it off.  I would be happy to crank it
up again.  With the atexit module and xmlrpclib in the core library, it's
dead easy to instrument a program so it automatically dumps the data to my
server upon program exit.

    Gustavo> 23632   LOAD_FAST, LOAD_ATTR

This is not all that surprising and supports Jeremy's belief (which I agree
with) that self.attr is a very common construct in the language.

    Gustavo> 15382   LOAD_CONST, LOAD_CONST

Now, this is interesting.  If those constants are numbers and the next
opcode is a BINARY_*, my peephole optimizer can elide that operation and
create a new constant, so something like

    LOAD_CONST 60
    LOAD_CONST 60
    BINARY_MULTIPLY

would get converted to simply 

    LOAD_CONST 3600

    Gustavo> 12842   JUMP_IF_FALSE, POP_TOP
    Gustavo> 12397   CALL_FUNCTION, POP_TOP

I don't think these can be avoided.

    Gustavo> 12121   LOAD_FAST, LOAD_FAST

While this pair occurs frequently, they are very cheap instructions.  All
you'd be saving is a trip around the opcode dispatch loop.

    Gustavo> Not by casuality, I found in your paper references to a
    Gustavo> LOAD_FAST_ATTR opcode. Since you probably have mentioned this
    Gustavo> to others, I wouldn't like to bother everyone again asking why
    Gustavo> it was not implemented. Could you please explain me the reasons
    Gustavo> that left this behind?

LOAD_ATTR is a *very* expensive opcode (perhaps only second to CALL_FUNCTION
on a per-instruction basis). Jeremy measured minimums of around 500 clock
cycles and means of around 1200 clock cycles for this opcode.  In contrast,
it appears that a single trip around the opcode dispatch loop is on the
order of 50 clock cycles, so merging a LOAD_FAST/LOAD_ATTR pair into one
instruction only saves about 50 cycles.  What you want to eliminate is the
500+ cycles from the LOAD_ATTR instruction.  Jeremy and I both have ideas
about how to accomplish some of that, but it's not a trivial task.

I believe in most cases I got about a 5% speedup with peephole optimization.
That's nothing to sneeze at I suppose, but there were some barriers to
adoption.  First and foremost, generating that instruction requires running
my optimizer, which isn't blindingly fast.  (Probably fast enough for a
"compileall" step that you execute once at install time, but maybe too slow
to use regularly on-the-fly.)  It also predates the compiler Jeremy
implemented in Python.  It would probably be fairly easy to hang my
optimizer off the back end of his compiler as an optional pass.  It looks
like Guido would like to see a little work put into regaining some of the
performance that was lost between 1.5.2 and 2.2, so now would probably be a
good time to dust off my optimizer.

    Gustavo> If you have the time, I'd also like to understand what's the
    Gustavo> trouble involved in getting a peephole optimizer in the python
    Gustavo> compiler itself.  Is it just about compiling performance?  I
    Gustavo> don't remember to have read about this in your paper, but you
    Gustavo> probably thought about that as well.

Mostly just time.  Tick tick tick...

Skip