A sets algorithm

Sun Feb 7 20:07:40 EST 2016

On Feb 7, 2016, at 4:46 PM, Paulo da Silva <p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:

> Hello!
> 
> This may not be a strict python question, but ...
> 
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
> 
> Thanks for any suggestions.

If you're after strict equality (every byte in a pair of files is identical), then here are a few heuristics that may help you:

1) Test for file length, without reading in the whole file.  You can use os.path.getsize() to do this (I hope that this is a constant-time operation, but I haven't tested it).  As Oscar Benjamin suggested, you can create a defaultdict(list) which will make it possible to gather lists of files of equal size.  This should help you gather your potentially identical files quickly.

2) Once you have your dictionary from above, you can iterate its values, each of which will be a list.  If a list has only one file in it, you know its unique, and you don't have to do any more work on it.  If there are two files in the list, then you have several different options:
	a) Use Chris Angelico's suggestion and hash each of the files (use the standard library's 'hashlib' for this).  Identical files will always have identical hashes, but there may be false positives, so you'll need to verify that files that have identical hashes are indeed identical.
	b) If your files tend to have sections that are very different (e.g., the first 32 bytes tend to be different), then you pretend that section of the file is its hash.  You can then do the same trick as above. (the advantage of this is that you will read in a lot less data than if you have to hash the entire file).
	c) You may be able to do something clever by reading portions of each file.  That is, use zip() combined with read(1024) to read each of the files in sections, while keeping hashes of the files.  Or, maybe you'll be able to read portions of them and sort the list as you're reading.  In either case, if any files are NOT identical, then you'll be able to stop work as soon as you figure this out, rather than having to read the entire file at once.

The main purpose of these suggestions is to reduce the amount of reading you're doing.  Storage tends to be slow, and any tricks that reduce the number of bytes you need to read in will be helpful to you.

Good luck!
Cem Karan