[Python-Dev] RE: The Dictionary Gem is polished!

Mon, 18 Dec 2000 19:09:35 -0500

Sounds good to me!  It's a very cheap way to get the high bits into play.

>        i = (~_hash) & mask

The ~ here seems like pure superstition to me (and the comments in the C
code don't justify it at all -- I added a nag of my own about that the last
time I checked in dictobject.c -- and see below for a bad consequence of
doing ~).

>            # note that we do not mask!
>            # even the shifting my not be worth it.
>            incr = _hash ^ (_hash >> 3)

The shifting was just another cheap trick to get *some* influence from the
high bits.  It's very limited, though.  Toss it (it appears to be from the
"random operations yield random results" <wink/sigh> matchbook school of
design).

[MAL]
> BTW, would changing the hash function on strings from the simple
> XOR scheme to something a little smarter help improve the performance
> too (e.g. most strings used in programming never use the 8-th
> bit) ?

Don't understand -- the string hash uses multiplication:

    x = (1000003*x) ^ *p++;

in a loop.  Replacing "^" there by "+" should yield slightly better results.
As is, string hashes are a lot like integer hashes, in that "consecutive"
strings

   J001
   J002
   J003
   J004
   ...

yield hashes very close together in value.  But, because the current dict
algorithm uses ~ on the full hash but does not use ~ on the initial
increment, (~hash)+incr too often yields the same result for distinct hashes
(i.e., there's a systematic (but weak) form of clustering).

Note that Python is doing something very unusual:  hashes are *usually*
designed to yield an approximation to randomness across all bits.  But
Python's hashes never achieve that.  This drives theoreticians mad (like the
fellow who originally came up with the GF idea), but tends to work "better
than random" in practice (e.g., a truly random hash function would almost
certainly produce many collisions when fed a fat range of consecutive
integers but still less than half the table size; but Python's trivial
"identity" integer hash produces no collisions in that common case).

[Christian]
> - The bits used from the string hash are not well distributed
> - using a "warmup wheel" on the hash to suck all bits in
>   gives the same quality of hashes like random numbers.

See above and be very cautious:  none of Python's hash functions produce
well-distributed bits, and-- in effect --that's why Python dicts often
perform "better than random" on common data.  Even what you've done so far
appears to provide marginally worse statistics for Guido's favorite kind of
test case ("worse" in two senses:  total number of collisions (a measure of
amortized lookup cost), and maximum collision chain length (a measure of
worst-case lookup cost)):

   d = {}
   for i in range(N):
       d[repr(i)] = i

check-in-one-thing-then-let-it-simmer-ly y'rs  - tim