a program to delete duplicate files

Patrick Useldinger pu.news.001 at gmail.com
Sat Mar 12 16:38:42 EST 2005


John Machin wrote:

> Maybe I was wrong: lawyers are noted for irritating precision. You
> meant to say in your own defence: "If there are *any* number (n >= 2)
> of identical hashes, you'd still need to *RE*-read and *compare* ...".

Right, that is what I meant.

> 2. As others have explained, with a decent hash function, the
> probability of a false positive is vanishingly small. Further, nobody
> in their right mind [1] would contemplate automatically deleting n-1
> out of a bunch of n reportedly duplicate files without further
> investigation. Duplicate files are usually (in the same directory with
> different names or in different-but-related directories with the same
> names) and/or (have a plausible explanation for how they were
> duplicated) -- the one-in-zillion-chance false-positive should stand
> out as implausible.

Still, if you can get it 100% right automatically, why would you bother 
checking manually? Why get back to argments like "impossible", 
"implausible", "can't be" if you can have a simple and correct answer - 
yes or no?

Anyway, fdups does not do anything else than report duplicates. 
Deleting, hardlinking or anything else might be an option depending on 
the context in which you use fdups, but then we'd have to discuss the 
context. I never assumed any context, in order to keep it as universal 
as possible.

> Different subject: maximum number of files that can be open at once. I
> raised this issue with you because I had painful memories of having to
> work around max=20 years ago on MS-DOS and was aware that this magic
> number was copied blindly from early Unix. I did tell you that
> empirically I could get 509 successful opens on Win 2000 [add 3 for
> stdin/out/err to get a plausible number] -- this seems high enough to
> me compared to the likely number of files with the same size -- but you
> might like to consider a fall-back detection method instead of just
> quitting immediately if you ran out of handles.

For the time being, the additional files will be ignored, and a warning 
is issued. fdups does not quit, why are you saying this?

A fallback solution would be to open the file before every _block_ read, 
and close it afterwards. In my mind, it would be a command-line option, 
because it's difficult to determine the number of available file handles 
in a multitasking environment.

Not difficult to implement, but I first wanted to refactor the code so 
that it's a proper class that can be used in other Python programs, as 
you also asked. That is what I have sent you tonight. It's not that I 
don't care about the file handle problem, it's just that I do changes by 
(my own) priority.

> You wrote at some stage in this thread that (a) this caused problems on
> Windows and (b) you hadn't had any such problems on Linux.
> 
> Re (a): what evidence do you have?

I've had the case myself on my girlfriend's XP box. It was certainly 
less than 500 files of the same length.

> Re (b): famous last words! How long would it take you to do a test and
> announce the margin of safety that you have?

Sorry, I do not understand what you mean by this.

-pu



More information about the Python-list mailing list