a program to delete duplicate files

Mon Mar 14 13:43:23 EST 2005

In article <1110617519.413704.237830 at g14g2000cwa.googlegroups.com>,
 "John Machin" <sjmachin at lexicon.net> wrote:

> Just look at the efficiency of processing N files of the same size S,
> where they differ after d bytes: [If they don't differ, d = S]

I think this misses the point.  It's easy to find the files that are 
different.  Just a file size comparison will get most of them, and most 
of the rest can be detected in the first few kbytes.  So I think it's 
safe to assume both of these filtering mechanisms would be incorporated 
into a good duplicate detection code, and that any remaining tests that 
would be performed are between files that are very likely to be the same 
as each other.

The hard part is verifying that the files that look like duplicates 
really are duplicates.  To do so, for a group of m files that appear to 
be the same, requires 2(m-1) reads through the whole files if you use a 
comparison based method, or m reads if you use a strong hashing method.  
You can't hope to cut the reads off early when using comparisons, 
because the files won't be different.

The question to me is: when you have a group of m>2 likely-to-be-equal 
files, is 2(m-1) reads and very little computation better than m reads 
and a strong hash computation?  I haven't yet seen an answer to this, 
but it's a question for experimentation rather than theory.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/