binary file compare...

Nigel Rantor wiggly at wiggly.org
Thu Apr 16 05:16:02 EDT 2009


Adam Olsen wrote:
> On Apr 15, 12:56 pm, Nigel Rantor <wig... at wiggly.org> wrote:
>> Adam Olsen wrote:
>>> The chance of *accidentally* producing a collision, although
>>> technically possible, is so extraordinarily rare that it's completely
>>> overshadowed by the risk of a hardware or software failure producing
>>> an incorrect result.
>> Not when you're using them to compare lots of files.
>>
>> Trust me. Been there, done that, got the t-shirt.
>>
>> Using hash functions to tell whether or not files are identical is an
>> error waiting to happen.
>>
>> But please, do so if it makes you feel happy, you'll just eventually get
>> an incorrect result and not know it.
> 
> Please tell us what hash you used and provide the two files that
> collided.

MD5

> If your hash is 256 bits, then you need around 2**128 files to produce
> a collision.  This is known as a Birthday Attack.  I seriously doubt
> you had that many files, which suggests something else went wrong.

Okay, before I tell you about the empirical, real-world evidence I have 
could you please accept that hashes collide and that no matter how many 
samples you use the probability of finding two files that do collide is 
small but not zero.

Which is the only thing I've been saying.

Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.

If you are of the opinion though that a hash function can be used to 
tell you whether or not two files are identical then you are wrong. It 
really is that simple.

I'm not sitting here discussing this for my health, I'm just trying to 
give the OP the benefit of my experience, I have worked with other 
people who insisted on this route and had to find out the hard way that 
it was a Bad Idea (tm). They just wouldn't be told.

Regards,

   Nige



More information about the Python-list mailing list