md5 and large files

Jeff Epler jepler at unpythonic.net
Sun Oct 17 12:55:06 EDT 2004


It seems likely that 2 files would have the same 4k "preamble".

For instance, a unix tar file containing a 16k "file1" and then a 1k
"file2" would have the same leading bytes as a unix tar file containing
a 16k "file1" and a 1k "file3", and therefore the md5sum over the first
4k would match. (these two tar files would also have the same byte
length)

If all pages on some website begin
    <HTML>
        <HEAD>
        <SCRIPT> pages and pages of javascript here (at least 4k) </SCRIPT>
        <TITLE> ...
the initial 4k might match, too.

But anyway, if s1 != s2, then the odds that hash(s1) != hash(s2) should
be small, and that shouldn't depend on the length of the string.

Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20041017/2aa0d610/attachment.sig>


More information about the Python-list mailing list