[Speed] A new perf module: toolkit to write benchmarks

Wed Jun 1 21:19:32 EDT 2016

Hi,

I started to write blog posts on stable benchmarks:

1) https://haypo.github.io/journey-to-stable-benchmark-system.html
2) https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
3) https://haypo.github.io/journey-to-stable-benchmark-average.html

One important point is that minimum is commonly used in Python
benchmarks, whereas it is a bad practice to get a stable benchmark.

I started to work on a toolkit to write benchmarks, the new "perf" module:
http://perf.readthedocs.io/en/latest/
https://github.com/haypo/perf

I used timeit as a concrete use case, since timeit is popular and
badly implemented. timeit currently uses 1 process running the
microbenchmarks 3 times and take the minimum. timeit is *known* to be
unstable, and the common advice is to run it at least 3 times and
again take the minimum of the minimum.

Example of links about timeit being unstable:

* https://mail.python.org/pipermail/python-dev/2012-August/121379.html
* https://bugs.python.org/issue23693
* https://bugs.python.org/issue6422 (not directly related)

Moreover, the timeit module disables the garbage collector which is
also wrong. It's wrong because it's rare to disable the GC in
applications.

My goal for the perf module is to provide basic features and then
reuse it in existing benchmarks:

* mean() and stdev() to display result
* clock chosen for benchmark
* result classes to store numbers
* etc.

Work in progress:

* new implementation of timeit using multiple processes
* perf.metadata module: collect various information about Python, the
system, etc.
* file format to store numbers and metadata

I'm interested by the very basic perf.py internal text format: one
timing per line, that's all. But it's incomplete, the "loops"
informaiton is not stored. Maybe a binary format is better? I don't
know yet.

It should be possible to cumulate files of multiple processes. I'm
also interested to implement a generic "rerun" command to add more
samples if a benchmark doesn't look stable enough.

perf.timeit looks more stable than timeit, the CLI is basically the
same: replace "-m timeit" with "-m perf.timeit".

5 timeit output ("1000000 loops, best of 3: ... per loop"):

* 0.247 usec
* 0.252 usec
* 0.247 usec
* 0.251 usec
* 0.251 usec

It's disturbing to get 3 different "minimums" :-/

5 perf.timeit outputs ("Average: 25 runs x 3 samples x 10^6 loops: ..."):

* 250 ns +- 3 ns
* 250 ns +- 3 ns
* 251 ns +- 3 ns
* 251 ns +- 4 ns
* 251 ns +- 3 ns

Note: I also got " 258 ns +- 17 ns" when I opened a webpage in Firefox
while the benchmark is running.

Note: I ran these benchmarks on a regular Linux without any specific
tuning. ASLR is enabled, but the system was idle.

Victor