[issue30671] dict: improve lookup function

Tim Peters report at bugs.python.org
Mon Jun 19 01:31:23 EDT 2017


Tim Peters added the comment:

Dmitry, I suggest you spend more of the time you give to thinking to writing code instead ;-)  But, really, that's the easiest & surest way to discover for yourself where your analysis is going off in the weeds.

For example, issue 28201 was both simpler and worse than your thoughts about it:  x and y collide initially if and only if x = y mod 2**k.  If they do collide, then _regardless_ of the value of N it's necessarily the case that N*x + x + 1 = N*y + y + 1 mod 2**k too.  That is, the second probes _always_ collided (not just half the time) in the case of a first-probe collision, and this would be so regardless of whether we were multiplying by 5, 4, 1, 0, or 31459265358.  That was an unfortunate "off by 1" kind of mistake in the way the code was written.  It had nothing to do with, e.g., zero divisors.

After the obvious fix was applied, there's very little that can be said in reality.  Because, after the fix, higher-order bits of the hash codes - which had nothing at all to do with the initial x = y mod 2**k collision - are added in to the second probe values.  There's no good reason I can see to calculate what happens if those never-before-considered bits _happen_ to be all zeroes.  They might be - but so what?  They might be any pair of values in the cross product of range(2**5) with itself.  There's nothing especially interesting about the (0, 0) pair.

That's why you're getting pushback:  your analysis hasn't made sense to me, and the things you're calculating still don't appear to have much of anything to do with how collisions are actually resolved.  To the contrary, so long as you're ignoring the higher-order hash code bits (about which we can infer _nothing_ from that the first probe collided), you're ignoring the _heart_ of the collision resolution strategy.

Some other things writing code would make immediately obvious:

- Hashes of strings have nothing to do with pointers or memory addresses.  They have solely to do with the characters the string contains, and all values in range(256) show up as the last byte of string hashes.

- While hashes of pointers do, the internal `_Py_HashPointer()` rotates addresses right by 4 bits so that the predictable trailing zero bits have no effect on the first probe.  For example,

>>> a = object()
>>> id(a)    # the address
1583819629360
>>> hex(_)
'0x170c301a330'
>>> hash(a)  # the address is rotated so `0x3` is the last nibble
98988726835
>>> hex(_)
'0x170c301a33'

Because of all that, I'm going to close this.  But if you have an idea that actually speeds things, please post a patch and reopen it!  While it's true that I don't expect to see an actual improvement, clocks usually beat arguments ;-)

----------
resolution:  -> rejected
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue30671>
_______________________________________


More information about the Python-bugs-list mailing list