[Python-Dev] Proposal for a common benchmark suite

Fri Apr 29 11:03:34 CEST 2011

Maciej Fijalkowski wrote:
> On Thu, Apr 28, 2011 at 11:10 PM, Stefan Behnel <stefan_ml at behnel.de> wrote:
>> M.-A. Lemburg, 28.04.2011 22:23:
>>> Stefan Behnel wrote:
>>>> DasIch, 28.04.2011 20:55:
>>>>> the CPython
>>>>> benchmarks have an extensive set of microbenchmarks in the pybench
>>>>> package
>>>> Try not to care too much about pybench. There is some value in it, but
>>>> some of its microbenchmarks are also tied to CPython's interpreter
>>>> behaviour. For example, the benchmarks for literals can easily be
>>>> considered dead code by other Python implementations so that they may
>>>> end up optimising the benchmarked code away completely, or at least
>>>> partially. That makes a comparison of the results somewhat pointless.
>>> The point of the micro benchmarks in pybench is to be able to compare
>>> them one-by-one, not by looking at the sum of the tests.
>>>
>>> If one implementation optimizes away some parts, then the comparison
>>> will show this fact very clearly - and that's the whole point.
>>>
>>> Taking the sum of the micro benchmarks only has some meaning
>>> as very rough indicator of improvement. That's why I wrote pybench:
>>> to get a better, more details picture of what's happening,
>>> rather than trying to find some way of measuring "average"
>>> use.
>>>
>>> This "average" is very different depending on where you look:
>>> for some applications method calls may be very important,
>>> for others, arithmetic operations, and yet others may have more
>>> need for fast attribute lookup.
>> I wasn't talking about "averages" or "sums", and I also wasn't trying to put
>> down pybench in general. As it stands, it makes sense as a benchmark for
>> CPython.
>>
>> However, I'm arguing that a substantial part of it does not make sense as a
>> benchmark for PyPy and others. With Cython, I couldn't get some of the
>> literal arithmetic benchmarks to run at all. The runner script simply bails
>> out with an error when the benchmarks accidentally run faster than the
>> initial empty loop. I imagine that PyPy would eventually even drop the loop
>> itself, thus leaving nothing to compare. Does that tell us that PyPy is
>> faster than Cython for arithmetic? I don't think it does.
>>
>> When I see that a benchmark shows that one implementation runs in 100% less
>> time than another, I simply go *shrug* and look for a better benchmark to
>> compare the two.
> 
> I second here what Stefan says. This sort of benchmarks might be
> useful for CPython, but they're not particularly useful for PyPy or
> for comparisons (or any other implementation which tries harder to
> optimize stuff away). For example a method call in PyPy would be
> inlined and completely removed if method is empty, which does not
> measure method call overhead at all. That's why we settled on
> medium-to-large examples where it's more of an average of possible
> scenarios than just one.

If CPython were to start incorporating any specialising optimisations,
pybench wouldn't be much use for CPython.
The Unladen Swallow folks didn't like pybench as a benchmark.