[Speed] Disable hash randomization to get reliable benchmarks

Mon Apr 25 03:52:39 EDT 2016

Hi,

The problem with not cheating on the statistical distribution is that
I would like to get a quick feedback on my changes to know if my
change is faster or not. Having to wait 1 hour to check a single
change is not really convenient. I prefer to get a feedback is less
than 5 minutes. Tune Linux and disable random hash allows to run less
iterations and so take less time. If you don't tune anything, you need
*a lot* of iterations to reduce the "noise" of the benchmark.

2016-04-25 8:25 GMT+02:00 Maciej Fijalkowski <fijall at gmail.com>:
> The problem with disabled ASLR is that you change the measurment from
> a statistical distribution, to one draw from a statistical
> distribution repeatedly.

The problem is that perf.py only runs one process per benchmark and
per Python binary. Let's that the binary A is run with no hash
collision, all dict access succeed at the first iteration, whereas the
binary B runs with many "hash collision" so get worse performance. Is
it fair to compare only these two specific runs? What can we conclude
from the result?

Note:  The dict type of CPython uses open addressing, so even if two
keys get the different hash value, a dict lookup may need more than
one iteration to retrieve a dict entry.

Right now, using ASLR and randomized hash function with perf.py is not
fair. I'm talking about only running perf.py once. I'm unable to
combine/compare manually multiple runs of perf.py. Getting multiple
different results is very confusing for me.

If you want to use ASRL and hash randomization, perf.py must be
modified to run multiple processes (sequentially) to get a better
statistical distribution. No?

I don't know how many processes do we have to run. Accoding to my
quick analysis there are 3 different cases, the strict minimum would
be to run 3 processes for my specific case. For bm_call_simple, the
number of loops is 15 for fast, 150 by default, 300 using rigorous.
Maybe we should only run one loop iteration per process?

FYI right now, I'm not using perf.py to prove that my patch makes
CPython faster, but more to analyze why my change makes CPython slower
:-) It *looks* like Python function calls are between 2 and 7% slower,
but it also looks that bm_call_simple is an unstable microbenchmark
:-(

> There is no going around doing multiple runs
> and doing an average on that. Essentially for the same reason why
> using min is much worse than using average, with ASLR say you get:
> 2.0+-0.3 which we run 5 times and 1.8, 1.9, 2.2, 2.1, 2.1 now if you
> disable ASLR, you get one draw repeated 5 times, which might be 2.0,
> but also might be 1.8, 5 times. That just hides the problem, but does
> not actually fix it (because if you touch something, stuff might be
> allocated in a different order and then you get a different draw)

My practical problem is to get reliable benchmark. If you don't tune
Linux and don't disable hash randomization, the results look "random".

Example of output without tuning. I ran "python3 perf.py
../default/python-revert ../default/python-commit -b call_simple -v
--fast" 3 times.

[ fast, 15 iterations ]

### call_simple ###
Min: 0.235318 -> 0.247203: 1.05x slower
Avg: 0.237601 -> 0.251384: 1.06x slower
Significant (t=-6.32)
Stddev: 0.00214 -> 0.00817: 3.8069x larger

### call_simple ###
Min: 0.234191 -> 0.247109: 1.06x slower
Avg: 0.234660 -> 0.247480: 1.05x slower
Significant (t=-36.14)
Stddev: 0.00102 -> 0.00093: 1.0967x smaller

### call_simple ###
Min: 0.235790 -> 0.247089: 1.05x slower
Avg: 0.238978 -> 0.247562: 1.04x slower
Significant (t=-9.38)
Stddev: 0.00342 -> 0.00094: 3.6504x smaller

You ask to ignore the Min line. Ok. But the average line say 1.04,
1.05 and 1.06x slower. Which one is the "good" result? :-)

Usually, the difference is much larger like between 1.02x slower and
1.07x slower.

[ rigorous, 300 iterations ]

The --fast option of perf.py is just a toy, right? Serious dev must
use the super slow --rigorous mode! Ok, let's try it.

### call_simple ###
Min: 0.234102 -> 0.248098: 1.06x slower
Avg: 0.236218 -> 0.254318: 1.08x slower
Significant (t=-30.32)
Stddev: 0.00561 -> 0.00869: 1.5475x larger

### call_simple ###
Min: 0.234109 -> 0.248024: 1.06x slower
Avg: 0.240069 -> 0.255194: 1.06x slower
Significant (t=-15.21)
Stddev: 0.01126 -> 0.01304: 1.1584x larger

### call_simple ###
Min: 0.235272 -> 0.248225: 1.06x slower
Avg: 0.244106 -> 0.258349: 1.06x slower
Significant (t=-13.27)
Stddev: 0.00830 -> 0.01663: 2.0053x larger

Again, I ignore the Min line. Average: hum... is it 1.06x or 1.08x
slower? For me, it's not really the same :-/

In the reference Python, the average timing changes between 0.236 and
0.244 in the 3 runs. Since the difference between the reference and
the patched Python is tiny, the stability of the microbenchmark
matters.

FYI I'm running benchmarks on my desktop PC. It's more convenient to
run benchmarks locally than transfering changes to an hypotetical
dedicated benchmark server (tuned to run more reliable
(micro)benchmarks). Since benchmarks are slow, I'm doing something
else while the benchmark is running.

Victor