a program to delete duplicate files

Mon Mar 14 14:23:52 EST 2005

David Eppstein wrote:

> The hard part is verifying that the files that look like duplicates 
> really are duplicates.  To do so, for a group of m files that appear to 
> be the same, requires 2(m-1) reads through the whole files if you use a 
> comparison based method, or m reads if you use a strong hashing method.  
> You can't hope to cut the reads off early when using comparisons, 
> because the files won't be different.

If you read them in parallel, it's _at most_ m (m is the worst case 
here), not 2(m-1). In my tests, it has always significantly less than m.

-pu