[Python-Dev] Reordering opcodes (PEP 203 Augmented Assignment)

M.-A. Lemburg mal@lemburg.com
Fri, 28 Jul 2000 16:59:57 +0200


Guido van Rossum wrote:
> 
> > Since we're at it, it's worth mentioning another conclusion we came across
> > at the time: the cache effects in the main loop are significant -- it is
> > more important to try keeping at best the main loop small enough, so that
> > those effects are minimized.
> 
> Yes, that's what Tim keeps hammering on too.

+1 from here ;-)

I have done quite a bit of testing with the old 1.5 ceval loop
(patch still available in case someone wants to look at it) and
found these things:

1. It pays off to special case LOAD_FAST by handling it before
   going into the switch at all, since this is the one most
   often used opcode in Python.

2. Reordering the opcodes has some noticable effect on performance
   too, even though it is very touchy to cache lines.

3. Splitting the big switch in two with the first one holding
   the most of often used opcodes while the second one takes
   care of the more esoteric ones helps keeping the inner
   loop in the CPU cache and thus increases performance.

From an old mail:
"""
Here is a little statistic that I've run
over a total of more than 100M opcodes:

Opcode frequencies:
------------------------------------------------------------------------
           LOAD_FAST(124) :   19323126 ================================
          SET_LINENO(127) :   15055591 ========================
          LOAD_CONST(100) :    9254683 ===============
           LOAD_NAME(101) :    8218954 =============
         LOAD_GLOBAL(116) :    7174876 ===========
          STORE_FAST(125) :    5927769 =========
             POP_TOP(  1) :    5587424 =========
       CALL_FUNCTION(131) :    5404709 ========
       JUMP_IF_FALSE(111) :    5289262 ========
          COMPARE_OP(106) :    4495179 =======
           LOAD_ATTR(105) :    3481878 =====
          BINARY_ADD( 23) :    3420811 =====
        RETURN_VALUE( 83) :    2221212 ===
          STORE_NAME( 90) :    2176228 ===
          STORE_ATTR( 95) :    2085338 ===
       BINARY_SUBSCR( 25) :    1834612 ===
       JUMP_ABSOLUTE(113) :    1648327 ==
        STORE_SUBSCR( 60) :    1446307 ==
        JUMP_FORWARD(110) :    1014821 =
     BINARY_SUBTRACT( 24) :     910085 =
           POP_BLOCK( 87) :     806160 =
        STORE_GLOBAL( 97) :     779880 =
            FOR_LOOP(114) :     735245 =
          SETUP_LOOP(120) :     657432 =
       BINARY_MODULO( 22) :     610121 =
                  32( 32) :     530811 
                  31( 31) :     530657 
     BINARY_MULTIPLY( 20) :     392274 
        SETUP_EXCEPT(121) :     285523 
...
"""

These stats were created using an instrumented Python interpreter
running real applications (and recording all executed opcodes
to a file), so they can be considered close enough to what
Python typically does in its inner loop.
 
> > An experiment I did at the time which gave some delta-speedup:
> > I folded most of the UNOP & BINOP opcodes since they differ only by the
> > functon they call and they duplicate most of the opcode body. Like this:
> [...]
> > This reduced the code size of ceval.c, which resulted in less cache effects
> > and gave more speed, despite the additional jumps. It possibly results in
> > less page-faults too, although this is unlikely.
> 
> I expect this is wholly attributable to the reduced code size.  Most
> binary operators aren't used frequently enough to make a difference in
> other ways.  If you put the common code at the end of the code for
> binary '+', that would optimize the most common operator.
> 
> > Which makes me think that, if we want to do something about cache effects,
> > it is probably not a bad idea to just "reorder" the bytecodes in the big
> > switch by decreasing frequency (we have some stats about this -- I believe
> > Skip and MAL have discussed the opcodes' frequency and the charts lie
> > somewhere in the archives).  I remember Marc-Andre had done something in
> > this direction and reported some perf improvements too.  Since reordering
> > the opcodes doesn't really hurt, if I'm about to do something with the
> > main loop, it'll be only this.
> 
> Go for it -- sounds good!

If you need help, I can dig up those old tools and patches...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/