a program to delete duplicate files

John Machin sjmachin at lexicon.net
Sat Mar 12 16:18:47 EST 2005


Patrick Useldinger wrote:
> John Machin wrote:
>
> > Just look at the efficiency of processing N files of the same size
S,
> > where they differ after d bytes: [If they don't differ, d = S]
> >
> > PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
> > which is important for small N and large d].
> >
> > Hashing method: O(NS) reading time, O(NS) hash calc time
>
>
> Shouldn't you add the additional comparison time that has to be done
> after hash calculation? Hashes do not give 100% guarantee. If there's
a
> large number of identical hashes, you'd still need to read all of
these
> files to make sure.

Maybe I was wrong: lawyers are noted for irritating precision. You
meant to say in your own defence: "If there are *any* number (n >= 2)
of identical hashes, you'd still need to *RE*-read and *compare* ...".

1. You are ahead already even before adding any extra time for
comparison on files with equal hashes.

2. As others have explained, with a decent hash function, the
probability of a false positive is vanishingly small. Further, nobody
in their right mind [1] would contemplate automatically deleting n-1
out of a bunch of n reportedly duplicate files without further
investigation. Duplicate files are usually (in the same directory with
different names or in different-but-related directories with the same
names) and/or (have a plausible explanation for how they were
duplicated) -- the one-in-zillion-chance false-positive should stand
out as implausible.

Different subject: maximum number of files that can be open at once. I
raised this issue with you because I had painful memories of having to
work around max=20 years ago on MS-DOS and was aware that this magic
number was copied blindly from early Unix. I did tell you that
empirically I could get 509 successful opens on Win 2000 [add 3 for
stdin/out/err to get a plausible number] -- this seems high enough to
me compared to the likely number of files with the same size -- but you
might like to consider a fall-back detection method instead of just
quitting immediately if you ran out of handles.

You wrote at some stage in this thread that (a) this caused problems on
Windows and (b) you hadn't had any such problems on Linux.

Re (a): what evidence do you have?

Re (b): famous last words! How long would it take you to do a test and
announce the margin of safety that you have?


[1] Yes, I am aware of the subject of this thread.

Regards,

John




More information about the Python-list mailing list