[Python-Dev] Re: The Dictionary Gem is polished!

Christian Tismer tismer@tismer.com
Tue, 19 Dec 2000 02:51:32 +0100


Tim Peters wrote:
> 
> Sounds good to me!  It's a very cheap way to get the high bits into play.

That's what I wanted to hear. It's also the reason why I try
to stay conservative: Just do an obviously useful bit, but
do not break any of the inherent benefits, like those
"better than random" amenities.
Python's dictionary algorithm appears to be "near perfect"
and of "never touch but veery carefully or redo it completely".
I tried the tightrope walk of just adding a tiny topping.

> >        i = (~_hash) & mask

Yes that stuff was 2 hours last nite :-)
I just decided to not touch it. Arbitrary crap!
Although an XOR with hash >> number of mask bits
would perform much better (in many cases but not all).
Anyway, simple shifting cannot solve general bit
distribution problems. Nor can I :-)

> The ~ here seems like pure superstition to me (and the comments in the C
> code don't justify it at all -- I added a nag of my own about that the last
> time I checked in dictobject.c -- and see below for a bad consequence of
> doing ~).
> 
> >            # note that we do not mask!
> >            # even the shifting my not be worth it.
> >            incr = _hash ^ (_hash >> 3)
> 
> The shifting was just another cheap trick to get *some* influence from the
> high bits.  It's very limited, though.  Toss it (it appears to be from the
> "random operations yield random results" <wink/sigh> matchbook school of
> design).

Now, comment it out, and you see my new algorithm perform much worse.
I just kept it since it had an advantage on "my case". (bad guy I know).
And I wanted to have an argument for my change to get accepted.
"No cost, just profit, nearly the same" was what I tried to sell.

> [MAL]
> > BTW, would changing the hash function on strings from the simple
> > XOR scheme to something a little smarter help improve the performance
> > too (e.g. most strings used in programming never use the 8-th
> > bit) ?
> 
> Don't understand -- the string hash uses multiplication:
> 
>     x = (1000003*x) ^ *p++;
> 
> in a loop.  Replacing "^" there by "+" should yield slightly better results.

For short strings, this prime has bad influence on the low bits,
making it perform supoptimally for small dicts.
See the new2 algo which funnily corrects for that.
The reason is obvious: Just look at the bit pattern
of 1000003:  '0xf4243'

Without giving proof, this smells like bad bit distribution on small
strings to me. You smell it too, right?

> As is, string hashes are a lot like integer hashes, in that "consecutive"
> strings
> 
>    J001
>    J002
>    J003
>    J004
>    ...
> 
> yield hashes very close together in value. 

A bad generator in that case. I'll look for a better one.

> But, because the current dict
> algorithm uses ~ on the full hash but does not use ~ on the initial
> increment, (~hash)+incr too often yields the same result for distinct hashes
> (i.e., there's a systematic (but weak) form of clustering).

You name it.

> Note that Python is doing something very unusual:  hashes are *usually*
> designed to yield an approximation to randomness across all bits.  But
> Python's hashes never achieve that.  This drives theoreticians mad (like the
> fellow who originally came up with the GF idea), but tends to work "better
> than random" in practice (e.g., a truly random hash function would almost
> certainly produce many collisions when fed a fat range of consecutive
> integers but still less than half the table size; but Python's trivial
> "identity" integer hash produces no collisions in that common case).

A good reason to be careful with changes(ahem).

> [Christian]
> > - The bits used from the string hash are not well distributed
> > - using a "warmup wheel" on the hash to suck all bits in
> >   gives the same quality of hashes like random numbers.
> 
> See above and be very cautious:  none of Python's hash functions produce
> well-distributed bits, and-- in effect --that's why Python dicts often
> perform "better than random" on common data.  Even what you've done so far
> appears to provide marginally worse statistics for Guido's favorite kind of
> test case ("worse" in two senses:  total number of collisions (a measure of
> amortized lookup cost), and maximum collision chain length (a measure of
> worst-case lookup cost)):
> 
>    d = {}
>    for i in range(N):
>        d[repr(i)] = i

Nah, I did quite a lot of tests, and the trip number shows a
variation of about 10%, without judging old or new for better.
This is just the randomness inside.

> check-in-one-thing-then-let-it-simmer-ly y'rs  - tim

This is why I think to be even more conservative:
Try to use a division wheel, but with the inverses
of the original primitive roots, just in order to
get at Guido's results :-)

making-perfect-hashes-of-interneds-still-looks-promising - ly y'rs

   - chris

-- 
Christian Tismer             :^)   <mailto:tismer@tismer.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com