[perl-python] a program to delete duplicate files

Christos TZOTZIOY Georgiou tzot at sil-tec.gr
Fri Mar 11 17:48:19 EST 2005


On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein
<eppstein at ics.uci.edu> might have written:

>More seriously, the best I can think of that doesn't use a strong slow 
>hash would be to group files by (file size, cheap hash) then compare 
>each file in a group with a representative of each distinct file found 
>among earlier files in the same group -- that leads to an average of 
>about three reads per duplicated file copy: one to hash it, and two for 
>the comparison between it and its representative (almost all of the 
>comparisons will turn out equal but you still need to check unless you 
>use a strong hash).

The code I posted in another thread (and provided a link in this one) does
exactly that (a quick hash of the first few K before calculating the whole
file's md5 sum).  However, Patrick's code is faster, reading only what's
necessary (he does what I intended to do, but I was too lazy-- I actually
rewrote from scratch one of the first programs I wrote in Python, which
obviously was too amateurish code for me to publish :)

It seems your objections are related to Xah Lee's specifications; I have no
objections to your objections (-:) other than that we are just trying to produce
something of practical value out of an otherwise doomed thread...
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...



More information about the Python-list mailing list