a program to delete duplicate files

Mon Mar 14 05:50:12 EST 2005

Patrick Useldinger <pu.news.001 at gmail.com> writes:

> David Eppstein wrote:
> 
> > The hard part is verifying that the files that look like duplicates
> > really are duplicates.  To do so, for a group of m files that appear
> > to be the same, requires 2(m-1) reads through the whole files if you
> > use a comparison based method, or m reads if you use a strong
> > hashing method.  You can't hope to cut the reads off early when
> > using comparisons, because the files won't be different.
> 
> If you read them in parallel, it's _at most_ m (m is the worst case
> here), not 2(m-1). In my tests, it has always significantly less than
> m.

Hmm, Patrick's right, David, isn't he?

Except when m gets really BIG (say, M), in which case I suppose m is
no longer the worst case number of whole-file reads.

And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).

John