md5 and large files

Tue Oct 19 14:32:07 EDT 2004

Tobias Pfeiffer wrote:
>>You may well search the space of
>>all possible files (in all file lengths), and never find a single
>>file that hashes the same as your original file.
> 
> 
> Are you sure about that? I have no clue about what the md5 algorithm 
> works like, but I'd think one could prove that with an number large 
> enough, every hash occurs twice. At last, md5 is not random.

I'm not sure. However, I'm also not sure of the contrary.

It is well possible to define a hash function which produces certain
hash values only for a single input. Consider this hash function
(operating on byte strings):

def hash(s):
   if len(s)==100:
     for c in s:
       if c != 'X': break
     else:
       return 0
   return len(s)+1

This hash function returns 0 only for a string of 100 Xs.
For any other input, it returns a different value.

So even though there are many collisions in the hash, there is
one input which never collides. The same may true for md5. Whether
it actually is true is still an open question, AFAIK. The proof
could go either way:
- here is string S which hashes as md5(S), and I can prove that
   no other string can possibly hash as md5(S).
- here is a procedure which, for every hash value H, produces two
   distinct strings S1 and S2 so that H==md5(S1)==md5(S2).

Regards,
Martin