crc32 to be used as hash

Tim Peters tim.peters at gmail.com
Thu Nov 11 19:51:33 EST 2004


[Weiguang Shi]
> The literature I was refering to includes:
> 
> . R. Jain, A Comparison of Hashing Schemes for Address Lookup in
>  Computer Networks, IEEE Trans. on Communications, 1992, v40, n3.
> 
> . Z. Cao, Z. Wang, E. Zegura, Performance of Hashing-based Schemes for
>  Internet Load Balancing, IEEE INFOCOM 2000, p332-341.

So they're doing hashing of "small" address-related inputs, squashing
them down a lot, and using a variety of different 32- and 16-bit CRC
algorithms.  In one case, they found that simply xor'ing fields
together did very well -- which implies that they weren't facing a
very demanding problem.

There's no reason to suppose, a priori, that's got much to do with the
32-bit hash distribution of strings-in-email PythonLabs tested.

Since you haven't said what it is you're trying to hash, it's anyone's
guess how relevant what anyone has done to what you want to do.  So
the thing for you to do is to write a little app and *try* different
ways.  Details can matter, a lot.

For example, we didn't run any tests on subfields of 32-bit hashes,
and some bits may be "more random" than others.  One of the papers you
cited found that it wasn't so in what they were testing.  But here's
an example showing that's not a reliable conclusion in general:

"""
def drive():
    from binascii import crc32
    mask = 0x1fffff
    d = {}
    count = 0
    bytes = map(chr, range(256))
    for i in bytes[-32:]:
        for j in bytes:
            prefix = i+j
            count += len(bytes)
            for k in bytes:
                d[crc32(prefix + k) & mask] = 1
    print mask+1, "bins", count, "balls", len(d), "occupied"

drive()
"""

This is the output:

2097152 bins 2097152 balls 1048576 occupied

Amazingly enough, crc32 on this data, and taking the last 21 bits,
managed to miss half the bins entirely.  For a truly random hash
yielding 21 bits, the expected # of occupied bins is about 1325653
when tossing 2**21 balls into 2**21 buckets, with a standard deviation
of about 452.  So this is more than 600 sdevs "worse than random".

If we replace crc32 with hash, the output changes to

2097152 bins 2097152 balls 1490432 occupied

which is a few hundred sdevs "better than random".  The reason
Python's hash can be "better than random" for bin problems is
explained in the comments in dictobject.c -- some of the internal hash
functions are specifically trying to give better-than-random collision
statistics when fed dense inputs, because hashing is used in Python
mostly to drive dict lookups.

This isn't 100% reliable either, though, and you can fiddle the above
to find subfields and inputs where crc32 does better than the builtin
hash.  For example, in the above, if the hash is shifted right by 11
before taking the last 21 bits, the builtin hash does horribly,
filling only 8K bins, but crc32 doesn't do any worse then than it did
before.



More information about the Python-list mailing list