A sets algorithm

Mon Feb 8 23:13:07 EST 2016

On Tuesday 09 February 2016 02:11, Chris Angelico wrote:

> That's fine for comparing one file against one other. He started out
> by saying he already had a way to compare files for equality. What he
> wants is a way to capitalize on that to find all the identical files
> in a group. A naive approach would simply compare every file against
> every other, for O(N*N) comparisons - but a hash lookup can make that
> O(N) on the files themselves, plus (I think) an O(N log N) hash
> comparison job, which has much lower constant factors.

You're mixing up N's there :-) In the first two (compare every file against 
every other file, and hash lookup), N stands for the number of files. But in 
the hash comparison, well, I'm not sure what you mean by that, unless you 
mean the calculation of the hash itself, which will be O(M) where M is the 
size of the file.

Unfortunately, even using a hash, you still have to do O(N**2) work, or 
rather, O(N*M) where N is the number of files and M is their size.

Assuming that duplicate files are relatively rare, the best way to do this 
is to collate files with the same size. Looking up the file size ought to be 
pretty fast. It is an IO operation, but it is approximately constant time in 
the sense that it doesn't actually depend on the size of the file.

(If you're using a programming language or operating system where the only 
way to find out how big a file is to open it and count the bytes, this 
obviously won't work for you.)

So let's collect files into a separate bucket for each unique file size:

bysize = {}
for file in files:
    bysize.setdefault(file.getsize(), []).append(file)

In practice, you may find that nearly all files have different sizes, and 
most of the bucks have only a single file in them. Those files you can just 
ignore, as they must be unique.

So instead of having to compare N files against each of the others, you are 
left with a much smaller number of buckets, each with (hopefully) only two 
or three files of the same size. You might have gone from N=1000000 to 
thirty buckets with five files each.

And most of those files will probably differ on the very first byte, or at 
least within the first 16K of bytes, so it's hardly worth hashing them 
(especially if the hash algorithm is an expensive cryptographic hash, or if 
the file is large).

I would delay the hash comparison as long as possible, something like this:

def __hash__(self):
    if self._hash is None:
        self._hash = calculate_hash()  # expensive, so we cache it
    return self._hash

def __eq__(self, other):
    size = self.getsize()
    if size != other.getsize():
        return False
    if self._hash and other._hash:
        # Both hashes are already done, so might as well use them.
        return (
            hash(self) == hash(other) 
            # in case of collisions...
            and self.read() == other.read()
            )
    if size < MAXSIZE:
        return self.read(size) == other.read(size)
    else:
        if self.read(MAXSIZE) != other.read(MAXSIZE):
            return False
        else:
             return (
                hash(self) == hash(other) 
                # in case of collisions...
                and self.read() == other.read()
                )

If your hash is strong enough that collisions are vanishingly rare, you 
could ignore the `and self.read() == other.read()` check.

-- 
Steve