BUG? sha-moduel returns same crc for different files

Tim Peters tim_one at email.msn.com
Mon Sep 18 12:24:13 EDT 2000


[Treutwein Guido]
>> While the probabilty of a file, to have a certain hash code is
>> 2^(-hash_bitlength), the probability of finding two files with
>> the same hash value is MUCH bigger; this is the so-called birthday
>> paradox. (due to the fact, that having 23 persons in a room, the
>> probabilty of having to with the same birthday is better than 50%;
>> for a 32bit-CRC the corresponding limit is about 77000 files for
>> a 50% chance).

[Erno Kuusela]
> 2^32 is much smaller than 2^160 (2^128 times smaller infact). how many
> files would be needed for there to be a 50% change of a sha-1 hash
> collision? (how is it calculated?)  2^160 is
> 1461501637330902918203684832716283019655932542976...

To a first approximation, which is more than adequate for this level of
analysis, if there are N birthdays then you can expect at least one
duplicate after roughly sqrt(N) random samples are drawn.  To a second
approximation, sqrt(pi*N/2) is closer, but it makes no practical difference
with numbers this large.  So about 2**80 for SHA (160-bit digest), 2**64 for
MD5 (128-bit digest), 2**16 for CRC32.  That's under assumptions of "true
randomness", of course, but they're not truly random, so it may be prudent
to be paranoid.  For example, it's very easy to construct any number of
distinct files with a given CRC32 hash; it's just *believed* to be
intractably difficult to do the same with MD5 or SHA.

...

>> If you don't have the time to write a hash function as
>> C extension package consider to use a combination of
>> sha-1, md-5, file size and crc32

> since there are 4294967296 times more possible values for sha-1
> than for md5, methinks this would not make much difference.

Then it depends on how valuable a small difference is to your application.
If you're betting someone's life on it, it's a good idea to combine a
variety of methods with different underpinnings (to guard against
currently-unknown systematic weakness in any one of them).  If you're just
trying to save a few of bytes of disk storage, a plain CRC32 is much cheaper
and probably adequate.





More information about the Python-list mailing list