crc32 to be used as hash

Tim Peters tim.peters at gmail.com
Thu Nov 11 14:50:36 EST 2004


[Weiguang Shi]
> I'm thinking of using binascii.crc32 as a hash-function when I read in
> the reference
> http://www.python.org/doc/current/lib/module-binascii.html:
>
>   crc32(       data[, crc])
> 
>      Compute CRC-32, the 32-bit checksum of data, starting with an
>      initial crc. This is consistent with the ZIP file checksum.
>      Since the algorithm is designed for use as a checksum algorithm,
>      it is not suitable for use as a general hash algorithm.
>
> CRC32 has been shown in the (Internetworking) literature that it can
> be used as a good hash function. I'm wondering what's the concern
> here.

Sorry, I've no idea what "(Internetworking) literature" means.

PythonLabs ran large-scale statistical tests of string hashing
algorithms in 2000, using millions of strings taken from a large email
corpus.  binascii.crc32 delivered collision rates hundreds of standard
deviations worse than a "truly random" 32-bit hash would have
delivered.  That's the concern.

In the same tests, Python's builtin hash() was statistically
indistinguishable from a random 32-bit hash function wrt # of
collisions.

BTW, it's possible to design CRC polynomials with better hash
statistics, but that's not what binascii.crc32 was designed for.  Like
most CRC polynomials in wide use, it was designed to catch common
forms of data corruption (single-bit errors, bursts of 0 or 1 bits at
either end, stuff like that).



More information about the Python-list mailing list