comparing multiple copies of terrabytes of data?

Mon Oct 25 16:01:20 EDT 2004

Istvan Albert <ialbert at mailblocks.com> wrote:
> 
> Dan Stromberg wrote:
> 
> > Rather than cmp'ing twice, to verify data integrity, I was thinking we
> > could speed up the comparison a bit, by using a python script that does 3
> 
> Use the cmp. So what if you must run it twice ... by the way I
> really doubt that you could speed up the process in python
> ... you'll probably end up with a much slower version

In this case you would be wrong.  Comparing data on a processor is
trivial (and would be done using Python's C internals anyways if a
strict string equality is all that matters), but IO is expensive.
Reading Terabytes of data is going to be the bottleneck, so reducing IO
is /the/ optimization that can and should be done.

The code to do so is simple:

def compare_3(fn1, fn2, fn3):
    f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
    b = 2**20 #tune this as necessary
    p = -1
    good = 1
    while f1.tell() < p:
        p = f1.tell()
        if f1.read(b) == f2.read(b) == f3.read(b):
            continue
        print "files differ"
        good = 0
        break
    if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
        print "files are identical"
    f1.close() #I prefer to explicitly close my file handles
    f2.close()
    f3.close()

Note that it /may/ be faster to first convert the data into arrays
(module array) to get 2, 4 or 8 byte block comparisons.

 - Josiah