[Python-Dev] further optimising the micro-optimisations for cache locality (fwd)
Vladimir Marangozov
Vladimir.Marangozov@inrialpes.fr
Fri, 28 Jul 2000 21:24:36 +0200 (CEST)
FYI.
It would be interesting to collect some feedback on these ideas for
popular combos. Who knows...
Forwarded message:
> From morton@dennisinter.com Fri Jul 28 20:48:29 2000
> From: morton@dennisinter.com (Damien Morton)
> To: <Vladimir.Marangozov@inrialpes.fr>
> Subject: further optimising the micro-optimisations for cache locality
> Date: Fri, 28 Jul 2000 14:34:29 -0400
> Message-ID: <NDBBLENPLCPAHKODFHNGOEADEBAA.morton@dennisinter.com>
> MIME-Version: 1.0
> Content-Type: text/plain;
> charset="iso-8859-1"
> Content-Transfer-Encoding: 7bit
> X-Priority: 3 (Normal)
> X-MSMail-Priority: Normal
> X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
> X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
> Importance: Normal
>
> Folded version:
>
> case BINARY_ADD:
> w = POP();
> v = POP();
> x = PyNumber_Add(v, w);
> goto end_binop;
>
> case BINARY_MULTIPLY:
> w = POP();
> v = POP();
> x = PyNumber_Multiply(v, w);
> goto end_binop;
>
> ...
> end_binop:
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
> break;
>
> Further folded version:
>
>
>
> case BINARY_POWER:
> f = PyNumber_Add;
> goto do_binop;
>
> case BINARY_MULTIPLY:
> f = PyNumber_Multiply;
> goto do_binop;
>
> ...
> do_binop:
> w = POP();
> v = POP();
> x = f(v, w);
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
> break;
>
> Further Further folded version:
> (i havent looked at the mainloop code, but can guess its structure)
>
> i assume 'op' is the opcode we are swtching on
>
> // the function to call for this op
> (void *)() op_func[] = { ..., PyNumber_Add, PyNumber_Multiply, ... };
>
> // the kind of op this func describes
> unsigned char op_type[] = { ..., DO_BINOP1, DO_BINOP2, DO_UNOP, ... };
>
> // these two tables will be cached because of the frequency the are accessed
> // ive used a table of pointers and a table of bytes to reduce the memory
> required
> // because the tables are small, locality within and between the tables isnt
> important
> // might be an idea to force the tables into contiguous memory somehow
> // i could also have used a table of structs, but alignment would increase
> memory usage
>
> unsigned char op;
>
> switch(op_type[op]) {
> case DO_BINOP1:
> w = POP();
> v = POP();
> x = op_func[op](v, w);
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
> break;
> case DO_BINOP2:
> w = POP();
> v = POP();
> x = op_func[op](v, w, Py_None);
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
> break;
> case DO_UNOP:
> v = POP();
> x = op_func[op](v);
> Py_DECREF(v);
> PUSH(x);
> if (x != NULL) continue;
> break;
>
> ===================== original message =================================
> Guido van Rossum wrote:
> >
> > > [me, on micro-optimizing the main loop]
> >
> > Fair enough (for one particular CPU at least).
>
> Since we're at it, it's worth mentioning another conclusion we came across
> at the time: the cache effects in the main loop are significant -- it is
> more important to try keeping at best the main loop small enough, so that
> those effects are minimized.
>
> An experiment I did at the time which gave some delta-speedup:
> I folded most of the UNOP & BINOP opcodes since they differ only by the
> functon they call and they duplicate most of the opcode body. Like this:
>
> Original:
> case BINARY_POWER:
> w = POP();
> v = POP();
> x = PyNumber_Power(v, w, Py_None);
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
> break;
>
> case BINARY_MULTIPLY:
> w = POP();
> v = POP();
> x = PyNumber_Multiply(v, w);
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
> break;
> ...
>
>
> Folded version:
>
> case BINARY_POWER:
> w = POP();
> v = POP();
> x = PyNumber_Power(v, w, Py_None);
> goto end_binop;
>
> case BINARY_MULTIPLY:
> w = POP();
> v = POP();
> x = PyNumber_Multiply(v, w);
> goto end_binop;
>
> ...
> end_binop:
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
> break;
>
>
> This reduced the code size of ceval.c, which resulted in less cache effects
> and gave more speed, despite the additional jumps. It possibly results in
> less page-faults too, although this is unlikely.
>
> Which makes me think that, if we want to do something about cache effects,
> it is probably not a bad idea to just "reorder" the bytecodes in the big
> switch by decreasing frequency (we have some stats about this -- I believe
> Skip and MAL have discussed the opcodes' frequency and the charts lie
> somewhere in the archives). I remember Marc-Andre had done something in
> this direction and reported some perf improvements too. Since reordering
> the opcodes doesn't really hurt, if I'm about to do something with the
> main loop, it'll be only this.
>
> >
> > > Micro-optimization doesn't worth the effort. As to the 2-byte arg limit,
> > > I think I'd be happy if the compiler raises an exception if this limit
> > > is exceeded.
> >
> > That would be a good first step indeed!
>
> I'll try to have a look, if someone else is not more reactive than me.
>
> --
> Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
> http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
>
--
Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252