comparing multiple copies of terrabytes of data?

Tue Oct 26 12:17:22 EDT 2004

Istvan Albert <ialbert at mailblocks.com> wrote:
> 
> Josiah Carlson wrote:
> 
> > The code to do so is simple:
> 
> ...
>  >     p = -1
>  >     good = 1
>  >     while f1.tell() < p:
>  >         p = f1.tell()
>  >         if f1.read(b) == f2.read(b) == f3.read(b):
>  >             continue
> 
> ...
> 
> What is slightly amusing is that your *simple*
> solution is actually incorrect. You got the
> comparison backwards in the while loop.

My finger slipped.

> Other functional deficiency when compared to
> the cmp diffs is that you don't know which
> file has changed or which byte differs
> ...  adding that brings about the potential
> for another set of bugs. Then someone else comes along
> who knows a little less about python and adds
> a little feature to the program that actually
> silently breaks it ...

You are now talking about code maintenance, which is a separate issue
from "how fast can I compare three files", which was originally asked
about.

> Whether or not it is actually faster remains
> to be seen. And that was my whole point,
> not to don't dismiss cmp to soon, see how it works
> test it, then armed with some real numbers one
> can make better decisions.

Ok.  Let me see.  I am using the fastest computer in the apartment; my
wife's 2.26 ghz P4 without hyperthreading, with a gig of memory, running
two 40GB ATA-100 drives.  I have created 3x1.5gb files on one of the
drives (I don't have 4.5tb free), which should give us a reasonable
measure, if shy by around 3 orders of magnitude in terms of time.

cmp looks to be taking 0-12% processor, so I was correct in my statement
that it is likely disk bound (unless your disks are 8 times faster or
your processor is 1/10 as fast).

Oh, goodness, it is still going.  I've been typing this email for over 8
minutes.  It sure would be nice if cmp had a progress bar.  At least
then I could know if I should kill it now or wait another few minutes.
To hell with it, I'm going to try the Python version.

*make little fix to code to give me a progress bar, time elapsed, and an
expected total time*

cmp: at least 8 1/2 minutes for the first cmp, was not finished, I
killed it. At least 17 minutes in total; 
Python: 15 minutes, 3 seconds to finish it all.

Hrm, that is only around 5 megs/second.  I think we can do better (even
if it is already better than cmp)...
New algorithm...

def compare_32(fn1, fn2, fn3):
    mn = md5.new
    st = time.time()
    f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
    b = 2**20 #tune this as necessary
    p = -1
    good = 1
    total = 1.5*2**30
    digs = []
    m = mn()
    while f1.tell() > p:
        p = f1.tell()
        digs.append(m.update(f1.read(b)).digest())
        if not ((p>>20)&7) and p:   
            a = 100.0*p/total/3
            d = time.time()-st
            print "%8.1f%%  %8.0f\t\r"%(a,100*d/a),
    m = mn()
    for dig in digs:
        if dig != m.update(f2.read(b)).digest():
            print "files 1 and 2 differ before", f2.tell()
            good = 0
            break
    m = mn()
    for dig in digs:
        if dig != m.update(f3.read(b)).digest():
            print "files 1 and 2 differ before", f3.tell()
            good = 0
            break
    if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
        print "files are identical"
    f1.close() #I prefer to explicitly close my file handles
    f2.close()
    f3.close()

That new one runs in 5 minutes 15 seconds total, because it exploits the
fact that sequential reads are fast.  It does use ~20% processor to
compute the md5s, which only makes a difference if your processor is as
fast as a 400mhz P4.

I'd say that Python wins here.  Is that concrete enough for you?

 - Josiah