how can I make this script shorter?

Tue Feb 22 05:54:48 EST 2005

Lowell Kirsh wrote:
> I have a script which I use to find all duplicates of files within a
> given directory and all its subdirectories. It seems like it's longer

> than it needs to be but I can't figure out how to shorten it. Perhaps

> there are some python features or libraries I'm not taking advantage
of.
>
> The way it works is that it puts references to all the files in a
> dictionary with file size being the key. The dictionary can hold
> multiple values per key. Then it looks at each key and all the
> associated files (which are the same size). Then it uses filecmp to
see
> if they are actually byte-for-byte copies.
>
> It's not 100% complete but it's pretty close.
>
> Lowell

To answer the question in the message subject: 1,$d

And that's not just the completely po-faced literal answer that the
question was begging for: why write something when it's already been
done? Try searching this newsgroup; there was a discussion on this very
topic only a week ago, during which the effbot provided the URL of an
existing python file duplicate detector. There seems to be a discussion
every so often ...

However if you persist in DIY, read the discussions in this newsgroup,
search the net (people have implemented this functionality in other
languages); think about some general principles -- like should you use
a hash (e.g. SHA-n where n is a suitably large number). If there are N
files all of the same size, you have two options (a) do O(N**2) file
comparisons or (b) do N hash calcs followed by O(N**2) hash
comparisons; then deciding on your
need/whim/costs-of-false-negatives/positives you can stop there or you
can do the file comparisons on the ones which match on hashes. You do
however need to consider that calculating the hash involves reading the
whole file, whereas comparing two files can stop when a difference is
detected. Also, do you understand and are you happy with using the
(default) "shallow" option of filecmp.cmp()?