binary file compare...

SpreadTooThin bjobrien62 at gmail.com
Thu Apr 16 13:15:14 EDT 2009


On Apr 16, 3:16 am, Nigel Rantor <wig... at wiggly.org> wrote:
> Adam Olsen wrote:
> > On Apr 15, 12:56 pm, Nigel Rantor <wig... at wiggly.org> wrote:
> >> Adam Olsen wrote:
> >>> The chance of *accidentally* producing a collision, although
> >>> technically possible, is so extraordinarily rare that it's completely
> >>> overshadowed by the risk of a hardware or software failure producing
> >>> an incorrect result.
> >> Not when you're using them to compare lots of files.
>
> >> Trust me. Been there, done that, got the t-shirt.
>
> >> Using hash functions to tell whether or not files are identical is an
> >> error waiting to happen.
>
> >> But please, do so if it makes you feel happy, you'll just eventually get
> >> an incorrect result and not know it.
>
> > Please tell us what hash you used and provide the two files that
> > collided.
>
> MD5
>
> > If your hash is 256 bits, then you need around 2**128 files to produce
> > a collision.  This is known as a Birthday Attack.  I seriously doubt
> > you had that many files, which suggests something else went wrong.
>
> Okay, before I tell you about the empirical, real-world evidence I have
> could you please accept that hashes collide and that no matter how many
> samples you use the probability of finding two files that do collide is
> small but not zero.
>
> Which is the only thing I've been saying.
>
> Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.
>
> If you are of the opinion though that a hash function can be used to
> tell you whether or not two files are identical then you are wrong. It
> really is that simple.
>
> I'm not sitting here discussing this for my health, I'm just trying to
> give the OP the benefit of my experience, I have worked with other
> people who insisted on this route and had to find out the hard way that
> it was a Bad Idea (tm). They just wouldn't be told.
>
> Regards,
>
>    Nige

And yes he is right CRCs hashing all have a probability of saying that
the files are identical when in fact they are not.



More information about the Python-list mailing list