[Python-Dev] PEP 203 Augmented Assignment

Vladimir Marangozov Vladimir.Marangozov@inrialpes.fr
Fri, 28 Jul 2000 16:04:03 +0200 (CEST)


Guido van Rossum wrote:
> 
> > [me, on micro-optimizing the main loop]
> 
> Fair enough (for one particular CPU at least).

Since we're at it, it's worth mentioning another conclusion we came across
at the time: the cache effects in the main loop are significant -- it is
more important to try keeping at best the main loop small enough, so that
those effects are minimized.

An experiment I did at the time which gave some delta-speedup:
I folded most of the UNOP & BINOP opcodes since they differ only by the
functon they call and they duplicate most of the opcode body. Like this:

Original:
                case BINARY_POWER:
                        w = POP();
                        v = POP();
                        x = PyNumber_Power(v, w, Py_None);
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
                
                case BINARY_MULTIPLY:
                        w = POP();
                        v = POP();
                        x = PyNumber_Multiply(v, w);
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;
                ...


Folded version:

                case BINARY_POWER:
                        w = POP();
                        v = POP();
                        x = PyNumber_Power(v, w, Py_None);
			goto end_binop;
                
                case BINARY_MULTIPLY:
                        w = POP();
                        v = POP();
                        x = PyNumber_Multiply(v, w);
			goto end_binop;

                ...
		end_binop:
                        Py_DECREF(v);
                        Py_DECREF(w);
                        PUSH(x);
                        if (x != NULL) continue;
                        break;


This reduced the code size of ceval.c, which resulted in less cache effects
and gave more speed, despite the additional jumps. It possibly results in
less page-faults too, although this is unlikely.

Which makes me think that, if we want to do something about cache effects,
it is probably not a bad idea to just "reorder" the bytecodes in the big
switch by decreasing frequency (we have some stats about this -- I believe
Skip and MAL have discussed the opcodes' frequency and the charts lie
somewhere in the archives).  I remember Marc-Andre had done something in
this direction and reported some perf improvements too.  Since reordering
the opcodes doesn't really hurt, if I'm about to do something with the
main loop, it'll be only this.

> 
> > Micro-optimization doesn't worth the effort. As to the 2-byte arg limit,
> > I think I'd be happy if the compiler raises an exception if this limit
> > is exceeded.
> 
> That would be a good first step indeed!

I'll try to have a look, if someone else is not more reactive than me.

-- 
       Vladimir MARANGOZOV          | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252