[issue45261] Unreliable (?) results from timeit (cache issue?)

Wed Sep 22 11:35:14 EDT 2021

Steven D'Aprano <steve+python at pearwood.info> added the comment:

Thank you Victor, I don't know whether I can convince you or you will 
convince me, but this is very interesting one way or the other. And I 
hope that Tim Holy is interested too :-)

On Wed, Sep 22, 2021 at 01:51:35PM +0000, STINNER Victor wrote:

> > But timing results are not like that, the measurement errors are 
> > one-sided, not two: (..)
> 
> I suggest you to run tons of benchmarks and look at the distribution. 
> The reality is more complex than what you may think.

We are trying to find the ideal time that the code would take with no 
external factors slowing it. No noise, no other processes running, no 
operating system overhead. The baseline time that it would take to run 
the code if the CPU could dedicate every cycle to your code and nothing 
else.

(Obviously we cannot measure that time on a real computer where there is 
always other processes running, but we want the best estimate of that 
time we can reach.)

There may still be conditions that can shift the baseline, e.g. the data 
fits in the CPU cache or it doesn't, but that's not *environmental* 
noise except so far as the impact of other processes might push your 
data out of the cache.

So in a noisy system (a real computer) there is some ideal time that the 
code would take to run if the CPU was dedicated to running your code 
only, and some actual time (ideal time + environmental noise) that we 
can measure.

The enviromental noise is never negative. That would mean that running 
more processes on the machine could make it faster. That cannot be.

> > measurement = true value + random error
> 
> In my experience, there is no single "true value", they are multiple 
> values. Concrete example where the Python randomized hash function 
> gives different value each time you spawn a new Python process:
>
> https://vstinner.github.io/journey-to-stable-benchmark-average.html
>
> Each process has its own "true value", but pyperf spawns 20 Python 
> processes :-)

Thank you for the link, I have read it.

Of course the more processes you spawn, the more environmental noise 
there is, which means more noise in your measurements.

If you run with 10 processes, or 100 processes, you probably won't get 
the same results for the two runs. (Sorry, I have not tried this myself. 
Consider this a prediction.)

You say:

"But how can we compare performances if results are random? Take the 
minimum? No! You must never (ever again) use the minimum for 
benchmarking! Compute the average and some statistics like the standard 
deviation"

but you give no reason for why you should not use the minimum. Without a 
reason why the minimum is *wrong*, I cannot accept that prohibition.

> Here is report is about a value lower than a single nanosecond: 
> "0.08ns/element".

An Intel i7 CPU performs at approximately 220000 MIPS, so you can 
execute about 17 or 18 CPU instructions in 0.08 ns. (Depends on the 
instruction, of course.) AMD Ryzen Threadripper may be ten times as 
fast. Without knowing the CPU and code I cannot tell whether to be 
impressed or shocked at the time.

> Again, good luck with benchmarking, it's a hard problem ;-)

Thank you, and I do appreciate all the work you have put into it, even 
if you have not convinced me to stop using the minimum.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________