binary file compare...

Adam Olsen rhamph at gmail.com
Fri Apr 17 00:44:42 EDT 2009


On Apr 16, 4:27 pm, "Rhodri James" <rho... at wildebst.demon.co.uk>
wrote:
> On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <rha... at gmail.com> wrote:
> > On Apr 16, 3:16 am, Nigel Rantor <wig... at wiggly.org> wrote:
> >> Okay, before I tell you about the empirical, real-world evidence I have
> >> could you please accept that hashes collide and that no matter how many
> >> samples you use the probability of finding two files that do collide is
> >> small but not zero.
>
> > I'm afraid you will need to back up your claims with real files.
>
> So that would be a "no" then.  If the implementation of dicts in Python,
> say, were to assert as you are that the hashes aren't going to collide,
> then I'd have to walk away from it.  There's no point in using something
> that guarantees a non-zero chance of corrupting your data.

Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
65 thousand) will give you a decent chance of a collision.  In
contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
2**256 (SHA-512).  The two are at totally different extremes.

There is *always* a non-zero chance of corruption, due to software
bugs, hardware defects, or even operator error.  It is only in that
broader context that you can realize just how minuscule the risk is.

Can you explain to me why you justify great lengths of paranoia, when
the risk is so much lower?


> Why are you advocating a solution to the OP's problem that is more
> computationally expensive than a simple byte-by-byte comparison and
> doesn't guarantee to give the correct answer?

For single, one-off comparison I have no problem with a byte-by-byte
comparison.  There's a decent chance the files won't be in the OS's
cache anyway, so disk IO will be your bottleneck.

Only if you're doing multiple comparisons is a hash database
justified.  Even then, if you expect matching files to be fairly rare
I won't lose any sleep if you're paranoid and do a byte-by-byte
comparison anyway.  New vulnerabilities are found, and if you don't
update promptly there is a small (but significant) chance of a
malicious file leading to collision.

That's not my concern though.  What I'm responding to is Nigel
Rantor's grossly incorrect statements about probability.  The chance
of collision, in our life time, is *insignificant*.

The Wayback Machine has 150 billion pages, so 2**37.  Google's index
is a bit larger at over a trillion pages, so 2**40.  A little closer
than I'd like, but that's still 562949950000000 to 1 odds of having
*any* collisions between *any* of the files.  Step up to SHA-256 and
it becomes 191561940000000000000000000000000000000000000000000000 to
1.  Sadly, I can't even give you the odds for SHA-512, Qalculate
considers that too close to infinite to display. :)

You should worry more about your head spontaneously exploding than you
should about a hash collision on that scale.  To do otherwise is
irrational paranoia.



More information about the Python-list mailing list