[Speed] New CPython benchmark suite based on perf

Tue Jul 5 04:08:52 EDT 2016

On Mon, 4 Jul 2016 22:51:11 +0200
Victor Stinner <victor.stinner at gmail.com>
wrote:
> 2016-07-04 19:49 GMT+02:00 Antoine Pitrou <solipsis at pitrou.net>:
> >>    Median +- Std dev: 256 ms +- 3 ms -> 262 ms +- 4 ms: 1.03x slower
> >
> > That doesn't sound like a terrific idea. Why do you think the median
> > gives a more interesting figure here?
> 
> When the distribution is uniform, mean and median are the same. In my
> experience with Python benchmarks, usually the curse is skewed: the
> right tail is much longer.
> 
> When the system noise is high, the skewness is much larger. In this
> case, median looks "more correct".

It "looks" more correct?

Let's say your Python implementation has a flaw: it is almost always
fast, but every 10 runs, it becomes 3x slower.  Taking the mean will
reflect the occasional slowness.  Taking the median will completely
hide it.

Then of course, since you have several processes and several runs per
process, you could try something more convoluted, such as
mean-of-medians or mean-of-mins or...

However, if you're concerned by system noise, there may be other ways
to avoid it. For example, measure both CPU time and wall time, and if
CPU time < 0.9 * wall time (for example), ignore the number and take
another measurement.

(this assumes all benchmarks are CPU-bound - which they should be here
- and single-threaded - which they *probably* are, except in a
hypothetical parallelizing Python implementation ;-)))

Regards

Antoine.