Python and the need for speed

Tue Apr 18 20:07:44 EDT 2017

On 19/04/17 00:33, bartc wrote:
> So that's 'label-pointers' which I assume must correspond to computed
> goto.

Yes - just different terminology. Being able to take the address of a 
label and "goto address" rather than "goto label".

> (I don't know why they should be faster than a switch; they just
> are.)

In C, the code generated for a switch() can do almost anything. It might 
cache the expression in a temporary and generate a linear "if .. else if 
.. else if ... else if" sequence (which is probably quite common for a 
sparsely populated set of values, after an initial range check is done) 
or it might generate a jump table (similar to the computed gotos) (which 
is probably quite common for a densely populated set of values, after an 
initial range check is done and an adjustment to make the index 
zero-based). It could also generate code that is some sort of hybrid of 
the two (for example, a switch with several densely-populated areas in 
an otherwise sparse set of values might do linear range-checks to find 
the right jump-table to use for each dense area).

What the computed goto stuff does is effectively reduce all of that to a 
single jump table with a known set of indices - there are 256 opcodes 
starting from 0 and each has an entry in an array of code pointers. To 
jump to the handler for an opcode, just 'goto handlers[op]'. No range 
checking, no index adjustment, no linear test sequences. This is what 
makes the dispatch fast.

> With the sort of lower level programs I write (in another dynamic
> language not Python), such an assembly layer improved performance 2-3
> times over using 100% HLL compiled using C and gcc-O3.

Did you give the C compiler enough hints though? Much like the above 
computed-goto stuff (which is less of a hint and more of an instruction 
that the compiler should be using jump tables and nothing else for that 
bit of code) there are lots of other ways of spelling things differently 
in C that can give better performance than what you might get by default 
(and I'm not even talking about compiler-specific #pragmas or whatever).

Also, remember that -O3 might (and by that I mean probably will! ;)) 
make your code larger. If you have some specific core areas of your 
interpreter that are now large enough to cause instruction cache misses 
then a smaller -O2 (or even -Os) compiled version might perform better 
on your hardware.

E.