a program to delete duplicate files

Mon Mar 14 15:42:27 EST 2005

On Mon, 14 Mar 2005 10:43:23 -0800, David Eppstein <eppstein at ics.uci.edu> wrote:

>In article <1110617519.413704.237830 at g14g2000cwa.googlegroups.com>,
> "John Machin" <sjmachin at lexicon.net> wrote:
>
>> Just look at the efficiency of processing N files of the same size S,
>> where they differ after d bytes: [If they don't differ, d = S]
>
>I think this misses the point.  It's easy to find the files that are 
>different.  Just a file size comparison will get most of them, and most 
>of the rest can be detected in the first few kbytes.  So I think it's 
>safe to assume both of these filtering mechanisms would be incorporated 
>into a good duplicate detection code, and that any remaining tests that 
>would be performed are between files that are very likely to be the same 
>as each other.
>
>The hard part is verifying that the files that look like duplicates 
>really are duplicates.  To do so, for a group of m files that appear to 
>be the same, requires 2(m-1) reads through the whole files if you use a 
>comparison based method, or m reads if you use a strong hashing method.
What do you mean by "a comparison based method" ? Are you excluding the
possibility of parallel reading and comparing? The problem then becomes
verifying that m buffers containing corresponding segments of the files
are equal. You could do that by comparing hashes of the buffers (or updated
running hashes for the whole files so far read), or you could compare the
buffers in parallel byte by byte if that turns out efficient to implement
in the given language and os environment. Small buffers would probably
not be efficient for disk access, but too large might not be good either.
Maybe some automatic tuning algorithm could be developed to optimize
comparison operations in the shadow of i/o. But does the OS accept hints
about read-ahead buffering? Are the disks SCSI raid or USB external or
networked? Can automatic tuning achieve an optimum disregarding all that?

What is it that we're trying to optimize? Time to classify the
whole file set into groups of equivalent files? (note that there can be
multiple groups that are internally equal but uneqal to members of other groups).

The problem of limitation on the number of simultaneously open files could
be virtualized away, and only have impact when the limit is actually encountered.

>You can't hope to cut the reads off early when using comparisons, 
>because the files won't be different.
So the worst case will be reading all through to the end once in parallel.
But pre-hashing guarantees the worst case for all -- unless disk access patterns
happen to make that the most efficient way to get everything read and compared
when most are equal.
>
>The question to me is: when you have a group of m>2 likely-to-be-equal 
>files, is 2(m-1) reads and very little computation better than m reads 
>and a strong hash computation?  I haven't yet seen an answer to this, 
>but it's a question for experimentation rather than theory.
ISTM 2(m-1) reads is not necessary, so "the question" to me doesn't involve that ;-)

Regards,
Bengt Richter