md5 and large files
Brad Tilley
rtilley at vt.edu
Sun Oct 17 13:22:26 EDT 2004
Martin v. Löwis wrote:
> Brad Tilley wrote:
>
>> Is reading the first 4096 bytes of the files and calculating the md5
>> sum based on that sufficient for uniquely identifying the files or am
>> I going about this totally wrong? Any advice or ideas appreciated.
>
>
> Clearly, you need to use the same procedure for later verification. The
> usual approach is to compute the md5sum for the entire file.
>
> Whether this is sufficient somewhat depends on what you want to achieve:
> - uniquely identify the file: this works reliable if there is some
> guarantee that no two such files will be identical within the first
> 4096 bytes. If your files are, say, log files with different starting
> dates, and the log file lines contain the starting dates, this is a
> safe assumption. If these are different versions of essentially the
> same file (e.g. different compilations of the same source code), I
> would not bet that different files already differ within the first
> 4096 bytes.
>
> - verify that the file is not corrupted, tampered with, modified.
> Your approach is clearly insufficient, as it can only detect
> modifications within the first 4096 bytes.
I would like to verify that the files are not corrupt so what's the most
efficient way to calculate md5 sums on 4GB files? The machine doing the
calculations is a small desktop with 256MB of RAM.
More information about the Python-list
mailing list