[perl-python] a program to delete duplicate files

Fri Mar 11 18:11:33 EST 2005

On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <eppstein at ics.uci.edu> wrote:

>In article <4232079b$1 at news.vo.lu>,
> Patrick Useldinger <pu.news.001 at gmail.com> wrote:
>
>> > Well, but the spec didn't say efficiency was the primary criterion, it 
>> > said minimizing the number of comparisons was.
>> 
>> That's exactly what my program does.
>
>If you're doing any comparisons at all, you're not minimizing the number 
>of comparisons.
>
ISTM a lot will depend on the distribution of differences between candidate
(equal-sized, obviously) files. If multi-GB files differ in the last byte only
(but this is not known), they will all have to be crunched to the last byte to
eliminate them from possible duplicate-set membership. If there is a-priori knowledge
of highly probable difference locations (e.g., internal date stamp at some offset etc)
then obviously the candidates can be pre-classified into smaller candidate sets by some
simple heuristic probes.

If differences are likely to appear early in a sequential scan, perhaps developing hashes
in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal.
Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to.

Regards,
Bengt Richter