md5 and large files
Peter L Hansen
peter at engcorp.com
Mon Oct 18 09:17:34 EDT 2004
Tobias Pfeiffer wrote:
> On 18 Okt 2004, Roger Binns wrote:
>>Brad Tilley wrote:
>>
>>>I would like to verify that the files are not corrupt so what's the
>>>most efficient way to calculate md5 sums on 4GB files? The machine
>>>doing the calculations is a small desktop with 256MB of RAM.
>>
>>If you need to be 100% certain, then only doing md5sum over the
>>entire file will work as Tim points out.
>
> This is not true. I'd say there are quite a lot of 2 GB files that
> produce the same md5 hash...
Without deliberately contriving an example using the recently
discovered technique, can you offer even a single example? ;-)
(If you were trying to point out that 100.00000000000% or whatever
is not possible with MD5, okay, but note that Roger didn't specify
the precision. 100% is close enough to what you'd get with MD5.)
> I think he has to see what he really wants to do with that file. If the
> goal is "compute the md5sum", then a loop with md5.update() seems most
> appropriate to me. If the goal is "check the equality" or "check whether
> they are corrupted", why md5? He can just read small blocks from the file
> and then do a simple string comparison. Might even be faster.
Simple string comparisons with *what*? Are you assuming that there
is a known-good copy of the file sitting right next to it, that he
can compare against?
> And here,
> the chance is really close to 100% he'd notice a change in the files. :-)
More information about the Python-list
mailing list