[issue45261] Unreliable (?) results from timeit (cache issue?)
Steven D'Aprano
report at bugs.python.org
Wed Sep 22 11:35:14 EDT 2021
Steven D'Aprano <steve+python at pearwood.info> added the comment:
Thank you Victor, I don't know whether I can convince you or you will
convince me, but this is very interesting one way or the other. And I
hope that Tim Holy is interested too :-)
On Wed, Sep 22, 2021 at 01:51:35PM +0000, STINNER Victor wrote:
> > But timing results are not like that, the measurement errors are
> > one-sided, not two: (..)
>
> I suggest you to run tons of benchmarks and look at the distribution.
> The reality is more complex than what you may think.
We are trying to find the ideal time that the code would take with no
external factors slowing it. No noise, no other processes running, no
operating system overhead. The baseline time that it would take to run
the code if the CPU could dedicate every cycle to your code and nothing
else.
(Obviously we cannot measure that time on a real computer where there is
always other processes running, but we want the best estimate of that
time we can reach.)
There may still be conditions that can shift the baseline, e.g. the data
fits in the CPU cache or it doesn't, but that's not *environmental*
noise except so far as the impact of other processes might push your
data out of the cache.
So in a noisy system (a real computer) there is some ideal time that the
code would take to run if the CPU was dedicated to running your code
only, and some actual time (ideal time + environmental noise) that we
can measure.
The enviromental noise is never negative. That would mean that running
more processes on the machine could make it faster. That cannot be.
> > measurement = true value + random error
>
> In my experience, there is no single "true value", they are multiple
> values. Concrete example where the Python randomized hash function
> gives different value each time you spawn a new Python process:
>
> https://vstinner.github.io/journey-to-stable-benchmark-average.html
>
> Each process has its own "true value", but pyperf spawns 20 Python
> processes :-)
Thank you for the link, I have read it.
Of course the more processes you spawn, the more environmental noise
there is, which means more noise in your measurements.
If you run with 10 processes, or 100 processes, you probably won't get
the same results for the two runs. (Sorry, I have not tried this myself.
Consider this a prediction.)
You say:
"But how can we compare performances if results are random? Take the
minimum? No! You must never (ever again) use the minimum for
benchmarking! Compute the average and some statistics like the standard
deviation"
but you give no reason for why you should not use the minimum. Without a
reason why the minimum is *wrong*, I cannot accept that prohibition.
> Here is report is about a value lower than a single nanosecond:
> "0.08ns/element".
An Intel i7 CPU performs at approximately 220000 MIPS, so you can
execute about 17 or 18 CPU instructions in 0.08 ns. (Depends on the
instruction, of course.) AMD Ryzen Threadripper may be ten times as
fast. Without knowing the CPU and code I cannot tell whether to be
impressed or shocked at the time.
> Again, good luck with benchmarking, it's a hard problem ;-)
Thank you, and I do appreciate all the work you have put into it, even
if you have not convinced me to stop using the minimum.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
More information about the Python-bugs-list
mailing list