A sets algorithm

Mon Feb 8 23:48:58 EST 2016

Chris Angelico wrote:
> hash_to_filename = defaultdict(list)
> for fn in files:
>     # Step 1: Hash every file.
>     hash = calculate_hash(fn)
> 
>     # Step 2: Locate all pairs of files with identical hashes
>     hash_to_filename[hash].append(fn)

I think you can avoid hashing the files altogether.

First divide the files into groups of the same size. Then for
each group, open all the files at once, read them in parallel
and compare them with each other.

As soon as you find a difference, split the group into
smaller groups. When a group gets down to just one file, you
can stop reading that file.

Assuming that most of the differing files differ near the
beginning, this strategy means that you will hardly ever
have to read a whole file. Hashing the files beforehand,
in contrast, requires reading all of every file every
time.

-- 
Greg