Building CPython

BartC bc at freeuk.com
Sun May 17 09:41:10 EDT 2015


On 17/05/2015 13:25, Jonas Wielicki wrote:
> On 16.05.2015 02:55, Gregory Ewing wrote:
>> BartC wrote:
>>> For example, there is a /specific/ byte-code called BINARY_ADD, which
>>> then proceeds to call a /generic/ binary-op handler! This throws away
>>> the advantage of knowing at byte-code generation time exactly which
>>> operation is needed.
>>
>> While inlining the binary-op handling might give you a
>> slightly shorter code path, it wouldn't necessarily speed
>> anything up. It's possible, for example, that the shared
>> binary-op handler fits in the instruction cache, but the
>> various inlined copies of it don't, leading to a slowdown.
>>
>> The only way to be sure about things like that is to try
>> them and measure. The days when you could predict the speed
>> of a program just by counting the number of instructions
>> executed are long gone.
>
> That, and also, the days where you could guess the number of
> instructions executed from looking at the code are also gone. Compilers,
> and especially C or C++ compilers, are huge beasts with an insane number
> of different optimizations which yield pretty impressive results. Not to
> mention that they may know the architecture you’re targeting and can
> optimize each build for a different architecture; which is not really
> possible if you do optimizations which e.g. rely on cache
> characteristics or instruction timings or interactions by hand.
>
> I changed my habits to just trust my compiler a few years ago and have
> more readable code in exchange for that. The compiler does a fairly
> great job, although gcc still outruns clang for *my* usecases.

> YMMV.

It does. For my interpreter projects, gcc -O3 does a pretty good job.

For running a suite of standard benchmarks ('spectral', 'fannkuch', 
'binary-tree', all that lot) in the bytecode language under test, then 
gcc is 30% faster than my own language/compiler. (And 25% faster than 
clang.)

(In that project, gcc can do a lot of inlining, which doesn't seem to be 
practical in CPython as functions are all over the place.)

However, when I plug in an ASM dispatcher to my version (which tries to 
deal with simple bytecodes or some common object types before passing 
control to the HLL to deal with), then I can get /twice as fast/ as gcc 
-O3. (For real programs the difference is narrower, but usually still 
faster than gcc.)

(This approach I don't think will work with CPython, because there don't 
appear to be any simple cases for ASM to deal with! The ASM dispatcher 
keeps essential globals such as the stack pointer and program counter in 
registers, and uses chained 'threaded' code rather than function calls. 
A proportion of byte-codes need to be handled in this environment, 
otherwise it could actually slow things down, as the switch to/from HLL 
code is expensive.)

-- 
Bartc



More information about the Python-list mailing list