[Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered

Thu Sep 15 12:13:54 EDT 2016

On Thu, 15 Sep 2016 08:02:10 -0700
Raymond Hettinger <raymond.hettinger at gmail.com> wrote:
> 
> Eric is correct on this one.  The consecutive hashes make a huge difference for Python 3.5.   While there is a table full table scan, the check for NULL entries becomes a predictable branch when all the keys are in consecutive positions.   There is an astonishingly well written stack overflow post that explains this effect clearly: http://stackoverflow.com/questions/11227809 
> 
> With normal randomized keys, Python 3.6 loop is dramatically better that Python 3.5:
[...]

You're jumping to conclusions. While there is a difference, there
is no evidence that the difference is due to better branch prediction.

Actually, let's do a quick back-of-the-envelope calculation and show
that it can't be attributed mostly to branch prediction:

> ~/py36 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
> 100 loops, best of 3: 12.3 msec per loop
> ~/py35 $ ./python.exe -m timeit -s "d=dict.fromkeys(map(str,range(10**6)))" "list(d)"
> 10 loops, best of 3: 54.7 msec per loop

For 10**6 elements, this is a 42ns difference per dict element.

A 2.6 Ghz Haswell doesn't stall for 42ns when there's a branch
mispredict.  According to the Internet, the branch mispredict penalty
for a Haswell CPU is 15 cycles, which is 5.7ns at 2.6 GHz.  Far from the
observed 42ns.

42ns, however, is congruent with another possible effect: a main memory
access following a last-level cache miss.

And indeed, Serhiy showed that this micro-benchmark is actually
dependent on insertion order *on Python 3.6*:

$ ./python -m timeit -s "l = [str(i) for i in range(10**6)];
d=dict.fromkeys(l)" "list(d)"
-> 100 loops, best of 3: 20 msec per loop

$ ./python -m timeit -s "import random; l = [str(i) for i in
range(10**6)]; random.shuffle(l); d=dict.fromkeys(l)" "list(d)"
-> 10 loops, best of 3: 55.8 msec per loop

The only way the table scan (without NULL checks, since this is
Python 3.6) can be dependent on insertion order is because iterating the
table elements needs to INCREF each element, that is: this benchmark
doesn't only scan the table in a nice prefetcher-friendly linear
sequence, it also accesses object headers at arbitrary places in
Python's heap memory.

Since this micro-benchmark creates the keys in order just before
filling the dict with them, randomizing the insertion order destroys
the temporal locality of object header accesses when iterating over the
dict keys. *This* looks like the right explanation, not branch
mispredicts due to NULL checks.

This also shows that a micro-benchmark that merely looks ok can actually
be a terrible proxy of actual performance.

As a further validation of this theory, let's dramatically decrease the
working set size on the initial benchmark:

$ ./python -m timeit -s "d=dict.fromkeys(map(str,range(10**3)))"
"list(d)"

-> Python 3.5: 100000 loops, best of 3: 10.9 usec per loop
-> Python 3.6: 100000 loops, best of 3: 9.72 usec per loop

When the working set fits in the cache, this micro-benchmark is
only 12% slower on 3.5 compared to 3.6.
*This* much smaller difference (a mere 1.2ns difference per dict
element) could be attributed to eliminating the NULL checks, or to any
other streamlining of the core iteration logic.

Regards

Antoine.