Python and the need for speed

Tue Apr 18 06:30:35 EDT 2017

On 18/04/2017 10:32, Erik wrote:

> FWIW, I spent some time about a year ago looking at things like this
> (small improvements to the peephole optimizer which allowed certain very
> common sequences to be folded into a (new) opcode which in turn allowed
> other optimizations to avoid branching). The changes worked, but didn't
> actually improve performance significantly in my tests (which is why I
> ended up not bothering to propose anything).
>
> I remember back in the day (circa 1.5.2?) that
> trips-around-the-interpreter-loop were significant and avoiding them
> could give wins. However, in the current CPython interpreter, the
> improvements over the original huge switch() to dispatch the bytecodes
> to the correct handler appear to have made this type of optimization
> less effective.

What did they do to it, and on which version? I've just downloaded 3.6.1 
and it was about 17% faster than 3.4.3 on running that test I posted a 
few days ago.

This was on Windows.

But I also looked at CPython a couple of years back with a view to doing 
something with it. I didn't manage anything, but learned that the 
gcc-compiled Linux version enabled label-pointers in place of the big 
switch, while the MSVC-compiled Windows version still used the switch.

The differences (on small integer benchmarks) were also 15% or so.

(I couldn't get anywhere with CPython because:

(1) It's so complicated that I could only build on Linux

(2) It depends on configure scripts and makefiles that I have no idea about

(3) None of my tools (editors etc) work on Linux so it was awkward 
(although cross-OS working using a real Windows and a virtual Linux 
would have been viable I suppose.)

(4) But the project was just too complicated to find my way around. It 
would have taken ages (the source is full of conditional blocks of code, 
and uses macros everywhere).

The thing I had in mind, that I've used elsewhere, was to add an extra 
layer to the byte-code dispatcher, written in assembly in its own module.

This has a dispatch loop (using 'threaded' code and with 
register-resident interpreter variables) which intercepts each 
byte-code, looks at it, and passes it on to the standard dispatcher if 
it can't deal with it.

So initially it would slow it down. But then you gradually add local 
handling of simple byte-codes, and eventually it gets faster than 
running the 100% HLL version.

But another problem was, there /are/ no simple byte-codes in CPython! Or 
very few (jump-absolute for example). Another optimisation for something 
like ADD, was to to check for very common types such as INT, and deal 
with those locally. But the handling required in the C code even for 
that looked horrendous.

You need to be able to able to handle such cases in a dozen lines of 
assembly, otherwise the code gets impractical.

If that had worked, then further optimisations are possible, such as 
doing a pre-pass combining common operations, that would not be 
worthwhile using 'official' byte-codes.)

-- 
bartc