md5 and large files
"Martin v. Löwis"
martin at v.loewis.de
Sun Oct 17 13:04:00 EDT 2004
Brad Tilley wrote:
> Is reading the first 4096 bytes of the files and calculating the md5 sum
> based on that sufficient for uniquely identifying the files or am I
> going about this totally wrong? Any advice or ideas appreciated.
Clearly, you need to use the same procedure for later verification. The
usual approach is to compute the md5sum for the entire file.
Whether this is sufficient somewhat depends on what you want to achieve:
- uniquely identify the file: this works reliable if there is some
guarantee that no two such files will be identical within the first
4096 bytes. If your files are, say, log files with different starting
dates, and the log file lines contain the starting dates, this is a
safe assumption. If these are different versions of essentially the
same file (e.g. different compilations of the same source code), I
would not bet that different files already differ within the first
4096 bytes.
- verify that the file is not corrupted, tampered with, modified.
Your approach is clearly insufficient, as it can only detect
modifications within the first 4096 bytes.
Regards,
Martin
More information about the Python-list
mailing list