md5 and large files

"Martin v. Löwis" martin at v.loewis.de
Sun Oct 17 13:04:00 EDT 2004


Brad Tilley wrote:
> Is reading the first 4096 bytes of the files and calculating the md5 sum 
> based on that sufficient for uniquely identifying the files or am I 
> going about this totally wrong? Any advice or ideas appreciated.

Clearly, you need to use the same procedure for later verification. The
usual approach is to compute the md5sum for the entire file.

Whether this is sufficient somewhat depends on what you want to achieve:
- uniquely identify the file: this works reliable if there is some
   guarantee that no two such files will be identical within the first
   4096 bytes. If your files are, say, log files with different starting
   dates, and the log file lines contain the starting dates, this is a
   safe assumption. If these are different versions of essentially the
   same file (e.g. different compilations of the same source code), I
   would not bet that different files already differ within the first
   4096 bytes.

- verify that the file is not corrupted, tampered with, modified.
   Your approach is clearly insufficient, as it can only detect
   modifications within the first 4096 bytes.

Regards,
Martin



More information about the Python-list mailing list