sorting with expensive compares?

Fri Dec 23 16:39:17 EST 2005

Dan Stromberg <strombrg at dcs.nac.uci.edu> writes:
> I've been using the following compare function, which in short checks, in
> order:
> 
> 1) device number
> 2) inode number
> 3) file length
> 4) the beginning of the file
> 5) an md5 hash of the entire file
> 6) the entire file
> 
> (If #1 and #2 are identical, then the file must be a hardlink to the other
> file.  Also, files that do not have the same length can never be
> identical.  And of course these items are generated on demand, making it
> frequently possible to avoid doing full-file-length compare on a lot of
> files)
> 
> However, my program is still doing more #6 comparisons than seems

Just don't do any #6 comparisons.  If the hashes are identical, the
files are identical--that's the point of a cryptographic hash.  OK, to
be pedantic:

   1) there's a microscopic chance of an accidental collision, but
   it's much lower than the chance of a hardware failure making your
   comparison function give the wrong answer.

   2) there are known cryptanalytic attacks against md5 that can let
   someone deliberately construct two distinct files with the same
   hash, with a fair amount of efort.  If you care about this, use
   sha-1 instead, or better yet, sha-256.  (sha-1 has an attack
   similar to the md5 attack, but the amount of work required is not
   really practical today, unlike the md5 attack).