a program to delete duplicate files

John J. Lee jjl at pobox.com
Mon Mar 14 05:50:12 EST 2005


Patrick Useldinger <pu.news.001 at gmail.com> writes:

> David Eppstein wrote:
> 
> > The hard part is verifying that the files that look like duplicates
> > really are duplicates.  To do so, for a group of m files that appear
> > to be the same, requires 2(m-1) reads through the whole files if you
> > use a comparison based method, or m reads if you use a strong
> > hashing method.  You can't hope to cut the reads off early when
> > using comparisons, because the files won't be different.
> 
> If you read them in parallel, it's _at most_ m (m is the worst case
> here), not 2(m-1). In my tests, it has always significantly less than
> m.

Hmm, Patrick's right, David, isn't he?

Except when m gets really BIG (say, M), in which case I suppose m is
no longer the worst case number of whole-file reads.

And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).


John



More information about the Python-list mailing list