binary file compare...

Thu Apr 16 10:59:19 EDT 2009

On 2009-04-16, Adam Olsen <rhamph at gmail.com> wrote:

>>>>> The chance of *accidentally* producing a collision, although
>>>>> technically possible, is so extraordinarily rare that it's
>>>>> completely overshadowed by the risk of a hardware or software
>>>>> failure producing an incorrect result.
>>>>
>>>> Not when you're using them to compare lots of files.
>>
>>>> Trust me. Been there, done that, got the t-shirt.

I must admit, that is a bit hard to believe.

>> Okay, before I tell you about the empirical, real-world
>> evidence I have could you please accept that hashes collide
>> and that no matter how many samples you use the probability of
>> finding two files that do collide is small but not zero.
>
> I'm afraid you will need to back up your claims with real files.
> Although MD5 is a smaller, older hash (128 bits, so you only need
> 2**64 files to find collisions),

You don't need quite that many to have a significant chance of
a collision.  With "only" something on the order of 2**61
files, you still have about a 1% chance of a collision.

Just for fun here are a few data points showing the probability
of at least one collision in a sample of size n using a perfect
128-bit hash.  These were calculated using the taylor-series
approximation shown at

 http://en.wikipedia.org/wiki/Birthday_paradox#Calculating_the_probability

For "a few million files" (we'll say 4e6), the probability of a
collision is so close to 0 that it can't be calculated using
double-precision IEEE floats.

For a few _billion_ files, the taylor series approximation is
still 0 to about 15 significant digits.

For a few _trillion_ files, you finally get a "non-zero"
probability of about 2e-14.

Here are a few more points:

n =     1806791225998070; p(n) = 0.00000000
n =     1987470348597877; p(n) = 0.00000001
n =     2186217383457665; p(n) = 0.00000001
n =     2404839121803431; p(n) = 0.00000001
n =     2645323033983774; p(n) = 0.00000001
n =     2909855337382151; p(n) = 0.00000001
n =     3200840871120366; p(n) = 0.00000002
n =     3520924958232403; p(n) = 0.00000002
n =     3873017454055643; p(n) = 0.00000002
n =     4260319199461207; p(n) = 0.00000003
n =     4686351119407328; p(n) = 0.00000003
n =     5154986231348061; p(n) = 0.00000004
n =     5670484854482868; p(n) = 0.00000005
n =     6237533339931155; p(n) = 0.00000006
n =     6861286673924271; p(n) = 0.00000007
n =     7547415341316699; p(n) = 0.00000008
n =     8302156875448370; p(n) = 0.00000010
n =     9132372562993208; p(n) = 0.00000012
n =    10045609819292530; p(n) = 0.00000015
n =    11050170801221784; p(n) = 0.00000018
n =    12155187881343964; p(n) = 0.00000022
n =    13370706669478362; p(n) = 0.00000026
n =    14707777336426200; p(n) = 0.00000032
n =    16178555070068822; p(n) = 0.00000038
n =    17796410577075706; p(n) = 0.00000047
n =    19576051634783280; p(n) = 0.00000056
n =    21533656798261608; p(n) = 0.00000068
n =    23687022478087772; p(n) = 0.00000082
n =    26055724725896552; p(n) = 0.00000100
n =    28661297198486208; p(n) = 0.00000121
n =    31527426918334832; p(n) = 0.00000146
n =    34680169610168320; p(n) = 0.00000177
n =    38148186571185152; p(n) = 0.00000214
n =    41963005228303672; p(n) = 0.00000259
n =    46159305751134040; p(n) = 0.00000313
n =    50775236326247448; p(n) = 0.00000379
n =    55852759958872200; p(n) = 0.00000458
n =    61438035954759424; p(n) = 0.00000555
n =    67581839550235368; p(n) = 0.00000671
n =    74340023505258912; p(n) = 0.00000812
n =    81774025855784816; p(n) = 0.00000983
n =    89951428441363312; p(n) = 0.00001189
n =    98946571285499648; p(n) = 0.00001439
n =   108841228414049616; p(n) = 0.00001741
n =   119725351255454592; p(n) = 0.00002106
n =   131697886381000064; p(n) = 0.00002548
n =   144867675019100096; p(n) = 0.00003084
n =   159354442521010112; p(n) = 0.00003731
n =   175289886773111136; p(n) = 0.00004515
n =   192818875450422272; p(n) = 0.00005463
n =   212100762995464512; p(n) = 0.00006610
n =   233310839295010976; p(n) = 0.00007998
n =   256641923224512096; p(n) = 0.00009678
n =   282306115546963328; p(n) = 0.00011710
n =   310536727101659712; p(n) = 0.00014169
n =   341590399811825728; p(n) = 0.00017144
n =   375749439793008320; p(n) = 0.00020744
n =   413324383772309184; p(n) = 0.00025099
n =   454656822149540160; p(n) = 0.00030369
n =   500122504364494208; p(n) = 0.00036745
n =   550134754800943680; p(n) = 0.00044460
n =   605148230281038080; p(n) = 0.00053794
n =   665663053309141888; p(n) = 0.00065088
n =   732229358640056192; p(n) = 0.00078751
n =   805452294504061824; p(n) = 0.00095280
n =   885997523954468096; p(n) = 0.00115278
n =   974597276349915008; p(n) = 0.00139469
n =  1072057003984906624; p(n) = 0.00168733
n =  1179262704383397376; p(n) = 0.00204131
n =  1297188974821737216; p(n) = 0.00246945
n =  1426907872303911168; p(n) = 0.00298726
n =  1569598659534302464; p(n) = 0.00361345
n =  1726558525487732736; p(n) = 0.00437061
n =  1899214378036506112; p(n) = 0.00528601
n =  2089135815840156928; p(n) = 0.00639252
n =  2298049397424172800; p(n) = 0.00772975
n =  2527854337166590464; p(n) = 0.00934539
n =  2780639770883249664; p(n) = 0.01129680

Here's the Python function I'm using:

def bp(n, d):
    return 1.0 - exp(-n*(n-1.)/(2.*d))

I haven't spent much time studying the numerical issues of the
way that the exponent is calculated, so I'm not entirely
confident in the results for "small" n values such that
p(n) == 0.0.

-- 
Grant Edwards                   grante             Yow! ... the MYSTERIANS are
                                  at               in here with my CORDUROY
                               visi.com            SOAP DISH!!