[Python-Dev] RE: The Dictionary Gem is polished!

Tim Peters tim.one@home.com
Thu, 21 Dec 2000 03:52:54 -0500


[Christian Tismer]
> Are you saying I should check the thing in? Really?

Of course.  The first thing you talked about showed a major improvement in
some bad cases, did no harm in the others, and both results were more than
just plausible -- they made compelling sense and were backed by simulation.
So why not check it in?  It's a clear net win!

Stuff since then has been a spattering of maybe-good maybe-bad maybe-neutral
ideas that hasn't gotten anywhere conclusive.  What I want to avoid is
another "Unicode compression" scenario, where we avoid grabbing a clear win
for months just because it may not be the best possible of all conceivable
compression schemes -- and then mistakes get made in a last-second rush to
get *any* improvement.

Checking in a clear improvement today does not preclude checking in a better
one next week <wink>.

> ...
> Ok, we have to stick with the given polymomials to stay
> compatible,

Na, feel free to explore that too, if you like.  It really should get some
study!  The polys there now are utterly arbitrary:  of all polys that happen
to be irreducible and that have x as a primitive root in the induced
multiplicative group, these are simply the smallest when viewed as binary
integers.  That's because they were *found* by trying all odd binary ints
with odd parity (even ints and ints with even parity necessarily correspond
to reducible polys), starting with 2**N+3 and going up until finding the
first one that was both irreducible and had x as a primitive root.  There's
no theory at all that I know of to say that any such poly is any better for
this purpose than any other.  And they weren't tested for that either --
they're simply the first ones "that worked at all" in a brute search.

Besides, Python's "better than random" dict behavior-- when it obtains! --is
almost entirely due to that its hash functions produce distinct starting
indices more often than a random hash function would.  The contribution of
the GF-based probe sequence in case of collision is to avoid the terrible
behavior most other forms of probe sequence would cause given that Python's
hash functions also tend to fill solid contiguous slices of the table more
often than would a random hash function.

[stuff about rotation]
> ...
> Too bad that this isn't supported in C. It is a native
> machine instruction on X86 machines.

Guido long ago rejected hash functions based on rotation for this reason;
he's not likely to approve of rotations more in the probe sequence <wink>.

A similar frustration is that almost modern CPUs have a fast instruction to
get at the high 32 bits of a 32x32->64 bit multiply:  another way to get the
high bits of the hash code into play is to multiply the 32-bit hash code by
a 32-bit constant (see Knuth for "Fibonacci hashing" details), and take the
least-significant N bits of the *upper* 32 bits of the 64-bit product as the
initial table index.  If the constant is chosen correctly, this defines a
permutation on the space of 32-bit unsigned ints, and can be very effective
at "scrambling" arithmetic progressions (which Python's hash functions often
produce).  But C doesn't give a decent way to get at that either.

> ...
> On the current "faster than random" cases, I assume that
> high bits in the hash are less likely than low bits,

I'm not sure what this means.  As the comment in dictobject.c says, it's
common for Python's hash functions to return a result with lots of leading
zeroes.  But the lookup currently applies ~ to those first (which is a bad
idea -- see earlier msgs), so the actual hash that gets *used* often has
lots of leading ones.

> so it is more likely that an entry finds its good place in the dict,
> before bits are rotated in. hence the "good" cases would be kept.

I can agree with this easily if I read the above as asserting that in the
very good cases today, the low bits of hashes (whether or not ~ is applied)
vary more than the high bits.

> ...
> Random integers seem to withstand any of these procedures.

If you wanted to, you could *define* random this way <wink>.

> ...
> I'd say let's do the patch   --   ciao - chris

full-circle-ly y'rs  - tim