how can I make this script shorter?

Wed Feb 23 04:57:05 EST 2005

Thanks for the advice. There are definitely some performance issues I 
hadn't thought of before. I guess it's time to go lengthen, not shorten, 
the script.

Lowell

John Machin wrote:
> Lowell Kirsh wrote:
> 
>>I have a script which I use to find all duplicates of files within a
>>given directory and all its subdirectories. It seems like it's longer
> 
> 
>>than it needs to be but I can't figure out how to shorten it. Perhaps
> 
> 
>>there are some python features or libraries I'm not taking advantage
> 
> of.
> 
>>The way it works is that it puts references to all the files in a
>>dictionary with file size being the key. The dictionary can hold
>>multiple values per key. Then it looks at each key and all the
>>associated files (which are the same size). Then it uses filecmp to
> 
> see
> 
>>if they are actually byte-for-byte copies.
>>
>>It's not 100% complete but it's pretty close.
>>
>>Lowell
> 
> 
> To answer the question in the message subject: 1,$d
> 
> And that's not just the completely po-faced literal answer that the
> question was begging for: why write something when it's already been
> done? Try searching this newsgroup; there was a discussion on this very
> topic only a week ago, during which the effbot provided the URL of an
> existing python file duplicate detector. There seems to be a discussion
> every so often ...
> 
> However if you persist in DIY, read the discussions in this newsgroup,
> search the net (people have implemented this functionality in other
> languages); think about some general principles -- like should you use
> a hash (e.g. SHA-n where n is a suitably large number). If there are N
> files all of the same size, you have two options (a) do O(N**2) file
> comparisons or (b) do N hash calcs followed by O(N**2) hash
> comparisons; then deciding on your
> need/whim/costs-of-false-negatives/positives you can stop there or you
> can do the file comparisons on the ones which match on hashes. You do
> however need to consider that calculating the hash involves reading the
> whole file, whereas comparing two files can stop when a difference is
> detected. Also, do you understand and are you happy with using the
> (default) "shallow" option of filecmp.cmp()?
>