md5 and large files

Brad Tilley rtilley at vt.edu
Sun Oct 17 13:22:26 EDT 2004


Martin v. Löwis wrote:
> Brad Tilley wrote:
> 
>> Is reading the first 4096 bytes of the files and calculating the md5 
>> sum based on that sufficient for uniquely identifying the files or am 
>> I going about this totally wrong? Any advice or ideas appreciated.
> 
> 
> Clearly, you need to use the same procedure for later verification. The
> usual approach is to compute the md5sum for the entire file.
> 
> Whether this is sufficient somewhat depends on what you want to achieve:
> - uniquely identify the file: this works reliable if there is some
>   guarantee that no two such files will be identical within the first
>   4096 bytes. If your files are, say, log files with different starting
>   dates, and the log file lines contain the starting dates, this is a
>   safe assumption. If these are different versions of essentially the
>   same file (e.g. different compilations of the same source code), I
>   would not bet that different files already differ within the first
>   4096 bytes.
> 
> - verify that the file is not corrupted, tampered with, modified.
>   Your approach is clearly insufficient, as it can only detect
>   modifications within the first 4096 bytes.

I would like to verify that the files are not corrupt so what's the most 
efficient way to calculate md5 sums on 4GB files? The machine doing the 
calculations is a small desktop with 256MB of RAM.



More information about the Python-list mailing list