md5 and large files

Mon Oct 18 09:17:34 EDT 2004

Tobias Pfeiffer wrote:
> On 18 Okt 2004, Roger Binns wrote:
>>Brad Tilley wrote:
>>
>>>I would like to verify that the files are not corrupt so what's the
>>>most efficient way to calculate md5 sums on 4GB files? The machine
>>>doing the calculations is a small desktop with 256MB of RAM.
>>
>>If you need to be 100% certain, then only doing md5sum over the
>>entire file will work as Tim points out.
> 
> This is not true. I'd say there are quite a lot of 2 GB files that 
> produce the same md5 hash...

Without deliberately contriving an example using the recently
discovered technique, can you offer even a single example? ;-)

(If you were trying to point out that 100.00000000000% or whatever
is not possible with MD5, okay, but note that Roger didn't specify
the precision.  100% is close enough to what you'd get with MD5.)

> I think he has to see what he really wants to do with that file. If the 
> goal is "compute the md5sum", then a loop with md5.update() seems most 
> appropriate to me. If the goal is "check the equality" or "check whether 
> they are corrupted", why md5? He can just read small blocks from the file 
> and then do a simple string comparison. Might even be faster. 

Simple string comparisons with *what*?  Are you assuming that there
is a known-good copy of the file sitting right next to it, that he
can compare against?

 > And here,
> the chance is really close to 100% he'd notice a change in the files. :-)