sorting with expensive compares?

Steven D'Aprano steve at REMOVETHIScyber.com.au
Fri Dec 23 15:48:23 EST 2005


On Fri, 23 Dec 2005 09:20:55 -0800, bonono wrote:

> 
> Dan Stromberg wrote:

[snip]

>> I've been using the following compare function, which in short checks, in
>> order:
>>
>> 1) device number
>> 2) inode number
>> 3) file length
>> 4) the beginning of the file
>> 5) an md5 hash of the entire file
>> 6) the entire file
>>
>> (If #1 and #2 are identical, then the file must be a hardlink to the other
>> file.  Also, files that do not have the same length can never be
>> identical.  And of course these items are generated on demand, making it
>> frequently possible to avoid doing full-file-length compare on a lot of
>> files)
>>
>> However, my program is still doing more #6 comparisons than seems
>> strictly necessary when I could just toss all the filenames describing
>> identical files into a list, and avoid re-comparing files with identical
>> content over and over - I don't want to compare them to each other again
>> and again), when there are a lot of identical files in the input list.
>
> Why would #5 not enough as an indicator that the files are indentical ?

(1) Because the MD5 algorithm does include collisions. I was going to say
"rare collisions", but there is actually an infinite number of them. The
collisions are just distributed thinly -- because MD5 is a good hash
algorithm, *very* thinly.

(Proof of this comes from the pigeon-hole principle: there is an infinite
number of possible files of arbitrary length, and only a finite number of
possible hashes. Therefore, an infinite number of files must hash to each
possible hash.)

(2) Having said that, unless the original poster is dealing with billions
(plus) of files, it is unlikely that he is finding any of the collisions
unless there is a bug in his sort routine. Since he claims to be doing
more comparisions-by-file-contents than expected (I would suggest *one*
is more than expected), I predict a bug in his code, his algorithm, or
both.


-- 
Steven.




More information about the Python-list mailing list