binary file compare...

Adam Olsen rhamph at gmail.com
Fri Apr 17 14:19:31 EDT 2009


On Apr 17, 5:30 am, Tim Wintle <tim.win... at teamrubber.com> wrote:
> On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote:
> > The Wayback Machine has 150 billion pages, so 2**37.  Google's index
> > is a bit larger at over a trillion pages, so 2**40.  A little closer
> > than I'd like, but that's still 562949950000000 to 1 odds of having
> > *any* collisions between *any* of the files.  Step up to SHA-256 and
> > it becomes 191561940000000000000000000000000000000000000000000000 to
> > 1.  Sadly, I can't even give you the odds for SHA-512, Qalculate
> > considers that too close to infinite to display. :)
>
> That might be true as long as your data is completely uniformly
> distributed. For the example you give there's:
>
> a) a high chance that there's "<html>" near the top
>
> b) a non-uniform distribution of individual words within the text.
>
> c) a non-unifom distribution of all n-grams within the text (as there is
> in natural language)
>
> So it's very far from uniformly distributed. Just about the only
> situation where I could imagine that holding would be where you are
> hashing uniformly random data for the sake of testing the hash.
>
> I believe the point being made is that comparing hash values is a
> probabilistic algorithm anyway, which is fine if you're ok with that,
> but for mission critical software it's crazy.

Actually, *cryptographic* hashes handle that just fine.  Even for
files with just a 1 bit change the output is totally different.  This
is known as the Avalanche Effect.  Otherwise they'd be vulnerable to
attacks.

Which isn't to say you couldn't *construct* a pattern that it would be
vulnerable to.  Figuring that out is pretty much the whole point of
attacking a cryptographic hash.  MD5 has significant vulnerabilities
by now, and other will in the future.  That's just a risk you need to
manage.



More information about the Python-list mailing list