[Python-Dev] further optimising the micro-optimisations for cache locality (fwd)

Fri, 28 Jul 2000 21:24:36 +0200 (CEST)

FYI.

It would be interesting to collect some feedback on these ideas for
popular combos. Who knows...

Forwarded message:
> From morton@dennisinter.com  Fri Jul 28 20:48:29 2000
> From: morton@dennisinter.com (Damien Morton)
> To: <Vladimir.Marangozov@inrialpes.fr>
> Subject: further optimising the micro-optimisations for cache locality
> Date: Fri, 28 Jul 2000 14:34:29 -0400
> Message-ID: <NDBBLENPLCPAHKODFHNGOEADEBAA.morton@dennisinter.com>
> MIME-Version: 1.0
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: 7bit
> X-Priority: 3 (Normal)
> X-MSMail-Priority: Normal
> X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
> X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
> Importance: Normal
> 
> Folded version:
> 
>                 case BINARY_ADD:
>                         w = POP();
>                         v = POP();
>                         x = PyNumber_Add(v, w);
> 			goto end_binop;
> 
>                 case BINARY_MULTIPLY:
>                         w = POP();
>                         v = POP();
>                         x = PyNumber_Multiply(v, w);
> 			goto end_binop;
> 
>                 ...
> 		end_binop:
>                         Py_DECREF(v);
>                         Py_DECREF(w);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
> 
> Further folded version:
> 
> 
> 
>                 case BINARY_POWER:
> 				f = PyNumber_Add;
> 			goto do_binop;
> 
>                 case BINARY_MULTIPLY:
> 				f = PyNumber_Multiply;
> 			goto do_binop;
> 
>                 ...
> 		do_binop:
>                         w = POP();
>                         v = POP();
>                         x = f(v, w);
>                         Py_DECREF(v);
>                         Py_DECREF(w);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
> 
> Further Further folded version:
> (i havent looked at the mainloop code, but can guess its structure)
> 
> i assume 'op' is the opcode we are swtching on
> 
> // the function to call for this op
> (void *)() op_func[] = { ..., PyNumber_Add, PyNumber_Multiply, ... };
> 
> // the kind of op this func describes
> unsigned char op_type[] = { ..., DO_BINOP1, DO_BINOP2, DO_UNOP, ... };
> 
> // these two tables will be cached because of the frequency the are accessed
> // ive used a table of pointers and a table of bytes to reduce the memory
> required
> // because the tables are small, locality within and between the tables isnt
> important
> // might be an idea to force the tables into contiguous memory somehow
> // i could also have used a table of structs, but alignment would increase
> memory usage
> 
> unsigned char op;
> 
> switch(op_type[op]) {
> 	case DO_BINOP1:
>                         w = POP();
>                         v = POP();
>                         x = op_func[op](v, w);
>                         Py_DECREF(v);
>                         Py_DECREF(w);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
> 	case DO_BINOP2:
>                         w = POP();
>                         v = POP();
>                         x = op_func[op](v, w, Py_None);
>                         Py_DECREF(v);
>                         Py_DECREF(w);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
> 	case DO_UNOP:
>                         v = POP();
>                         x = op_func[op](v);
>                         Py_DECREF(v);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
> 
> ===================== original message =================================
> Guido van Rossum wrote:
> >
> > > [me, on micro-optimizing the main loop]
> >
> > Fair enough (for one particular CPU at least).
> 
> Since we're at it, it's worth mentioning another conclusion we came across
> at the time: the cache effects in the main loop are significant -- it is
> more important to try keeping at best the main loop small enough, so that
> those effects are minimized.
> 
> An experiment I did at the time which gave some delta-speedup:
> I folded most of the UNOP & BINOP opcodes since they differ only by the
> functon they call and they duplicate most of the opcode body. Like this:
> 
> Original:
>                 case BINARY_POWER:
>                         w = POP();
>                         v = POP();
>                         x = PyNumber_Power(v, w, Py_None);
>                         Py_DECREF(v);
>                         Py_DECREF(w);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
> 
>                 case BINARY_MULTIPLY:
>                         w = POP();
>                         v = POP();
>                         x = PyNumber_Multiply(v, w);
>                         Py_DECREF(v);
>                         Py_DECREF(w);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
>                 ...
> 
> 
> Folded version:
> 
>                 case BINARY_POWER:
>                         w = POP();
>                         v = POP();
>                         x = PyNumber_Power(v, w, Py_None);
> 			goto end_binop;
> 
>                 case BINARY_MULTIPLY:
>                         w = POP();
>                         v = POP();
>                         x = PyNumber_Multiply(v, w);
> 			goto end_binop;
> 
>                 ...
> 		end_binop:
>                         Py_DECREF(v);
>                         Py_DECREF(w);
>                         PUSH(x);
>                         if (x != NULL) continue;
>                         break;
> 
> 
> This reduced the code size of ceval.c, which resulted in less cache effects
> and gave more speed, despite the additional jumps. It possibly results in
> less page-faults too, although this is unlikely.
> 
> Which makes me think that, if we want to do something about cache effects,
> it is probably not a bad idea to just "reorder" the bytecodes in the big
> switch by decreasing frequency (we have some stats about this -- I believe
> Skip and MAL have discussed the opcodes' frequency and the charts lie
> somewhere in the archives).  I remember Marc-Andre had done something in
> this direction and reported some perf improvements too.  Since reordering
> the opcodes doesn't really hurt, if I'm about to do something with the
> main loop, it'll be only this.
> 
> >
> > > Micro-optimization doesn't worth the effort. As to the 2-byte arg limit,
> > > I think I'd be happy if the compiler raises an exception if this limit
> > > is exceeded.
> >
> > That would be a good first step indeed!
> 
> I'll try to have a look, if someone else is not more reactive than me.
> 
> --
>        Vladimir MARANGOZOV          | Vladimir.Marangozov@inrialpes.fr
> http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
> 

-- 
       Vladimir MARANGOZOV          | Vladimir.Marangozov@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252