sorting with expensive compares?
Paul Rubin
http
Fri Dec 23 16:39:17 EST 2005
Dan Stromberg <strombrg at dcs.nac.uci.edu> writes:
> I've been using the following compare function, which in short checks, in
> order:
>
> 1) device number
> 2) inode number
> 3) file length
> 4) the beginning of the file
> 5) an md5 hash of the entire file
> 6) the entire file
>
> (If #1 and #2 are identical, then the file must be a hardlink to the other
> file. Also, files that do not have the same length can never be
> identical. And of course these items are generated on demand, making it
> frequently possible to avoid doing full-file-length compare on a lot of
> files)
>
> However, my program is still doing more #6 comparisons than seems
Just don't do any #6 comparisons. If the hashes are identical, the
files are identical--that's the point of a cryptographic hash. OK, to
be pedantic:
1) there's a microscopic chance of an accidental collision, but
it's much lower than the chance of a hardware failure making your
comparison function give the wrong answer.
2) there are known cryptanalytic attacks against md5 that can let
someone deliberately construct two distinct files with the same
hash, with a fair amount of efort. If you care about this, use
sha-1 instead, or better yet, sha-256. (sha-1 has an attack
similar to the md5 attack, but the amount of work required is not
really practical today, unlike the md5 attack).
More information about the Python-list
mailing list