a program to delete duplicate files

Sat Mar 12 09:50:16 EST 2005

Patrick Useldinger wrote:
> Just to explain why I appear to be a lawer: everybody I spoke to about 
> this program told me to use hashes, but nobody has been able to explain 
> why. I found myself 2 possible reasons:
> 
> 1) it's easier to program: you don't compare several files in parallel, 
> but process one by one. But it's not perfect and you still need to 
> compare afterwards. In the worst case, you end up with 3 files with 
> identical hashes, of which 2 are identical and 1 is not. In order to 
> find this, you'd still have to program the algorithm I use, unless you 
> say "oh well, there's a problem with the hash, go and look yourself."
> 
> 2) it's probably useful if you compare files over a network and you want 
> to reduce bandwidth. A hash lets you do that at the cost of local CPU 
> and disk usage, which may be OK. That was not my case.
> 
3) Probability is on your side.  If two files match in length, but differ
    in contents, a good hash will have probability (1 / max_hash) of
    having a matching hash.  For exactly two files of the same length,
    calculating the hash is clearly a waste.  For even three files of
    the same length, you are most likely to find them distinct in three
    comparisons.  Using hashes, three file reads and three comparisons
    of hash values.  Without hashes, six file reads; you must read both
    files to do a file comparison, so three comparisons is six files.
    Naturally, as it grows beyond three, the deference grows drastically.

-Scott David Daniels
Scott.Daniels at Acm.Org