a program to delete duplicate files
Patrick Useldinger
pu.news.001 at gmail.com
Mon Mar 14 14:23:52 EST 2005
David Eppstein wrote:
> The hard part is verifying that the files that look like duplicates
> really are duplicates. To do so, for a group of m files that appear to
> be the same, requires 2(m-1) reads through the whole files if you use a
> comparison based method, or m reads if you use a strong hashing method.
> You can't hope to cut the reads off early when using comparisons,
> because the files won't be different.
If you read them in parallel, it's _at most_ m (m is the worst case
here), not 2(m-1). In my tests, it has always significantly less than m.
-pu
More information about the Python-list
mailing list