[issue45261] Unreliable (?) results from timeit (cache issue?)

Wed Sep 22 06:46:30 EDT 2021

Steven D'Aprano <steve+python at pearwood.info> added the comment:

Thanks Victor for the explanation about pyperf's addition features. They 
do sound very useful. Perhaps we should consider adding some of them to 
timeit?

However, in my opinion using the average is statistically wrong. Using 
the mean is good when errors are two-sided, that is, your measured value 
can be either too low or too high compared to the measured value:

    measurement = true value ± random error

If the random errors are symmetrically distributed, then taking the 
average tends to cancel them out and give you a better estimate of the 
true value. Even if the errors aren't symmetrical, the mean will still 
be a better estimate of the true value. (Or perhaps a trimmed mean, or 
the median, if there are a lot of outliers.)

But timing results are not like that, the measurement errors are 
one-sided, not two:

    measurement = true value + random error

So by taking the average, all you are doing is averaging the errors, not 
cancelling them. The result you get is *worse* as an estimate of the 
true value than the minimum.

All those other factors (ignore the warmup, check for a small stdev, 
etc) seem good to me. But the minimum, not the mean, is still going to 
be closer to the true cost of running the code.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________