[Python-Dev] Microbenchmarks

Thu Sep 15 15:33:28 EDT 2016

The discussion on benchmarking is no more related to compact dict, so
I start a new thread.

2016-09-15 13:27 GMT+02:00 Paul Moore <p.f.moore at gmail.com>:
> Just as a side point, perf provided essentially identical results but
> took 2 minutes as opposed to 8 seconds for timeit to do so. I
> understand why perf is better, and I appreciate all the work Victor
> did to create it, and analyze the results, but for getting a quick
> impression of how a microbenchmark performs, I don't see timeit as
> being *quite* as bad as Victor is claiming.

He he, I expected such complain. I already wrote a section in the doc
explaining "why perf is so slow":
http://perf.readthedocs.io/en/latest/perf.html#why-is-perf-so-slow

So you say that timeit just works and is faster? Ok. Let's see a small session:

$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 46.7 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 46.9 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 46.9To msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 47 msec per loop

$ python2 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 36.3 msec per loop
$ python2 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 36.1 msec per loop
$ python2 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 36.5 msec per loop

$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 48.3 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 48.4 msec per loop
$ python3 -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
10 loops, best of 3: 48.8 msec per loop

I ran timeit 7 times on Python 3 and 3 times on Python 2. Please
ignore Python 2, it's just a quick command to interfere with Python 3
tests.

Now the question is: what is the "correct" result for Python3? Let's
take the minimum of the minimums: 46.7 ms.

Now imagine that you only ran only have the first 4 runs. What is the
"good" result now? Min is still 46.7 ms.

And what if you only had the last 3 runs? What is the "good" result
now? Min becomes 48.3 ms.

On such microbenchmark, the difference between 46.7 ms and 48.3 ms is large :-(

How do you know that you ran timeit enough times to make sure that the
result is the good one?

For me, the timeit tool is broken because you *must* run it many times
to workaround its limits.

In short, I wrote the perf module to answer to these questions.

* perf uses multiple processes to test multiple memory layouts and
multiple randomized hash functions
* perf ignores the first run, used to "warmup" the benchmark
(--warmups command line option)
* perf provides many tools to analyze the distribution of results:
minimum, maximum, standard deviation, histogram, number of samples,
median, etc.
* perf displays the median +- standard deviation: median is more
reproductible and standard deviation gives an idea of the stability of
the benchmark
* etc.

> I will tend to use perf now that I have it installed, and now that I
> know how to run a published timeit invocation using perf. It's a
> really cool tool. But I certainly won't object to seeing people
> publish timeit results (any more than I'd object to *any*
> mirobenchmark).

I consider that timeit results are not reliable at all. There is no
standard deviation and it's hard to guess how much times the user ran
timeit nor how he/she computed the "good result".

perf takes ~60 seconds by default. If you don't care of the accuracy,
use --fast and it now only takes 20 seconds ;-)

Victor