comparing multiple copies of terrabytes of data?
Josiah Carlson
jcarlson at uci.edu
Tue Oct 26 12:17:22 EDT 2004
Istvan Albert <ialbert at mailblocks.com> wrote:
>
> Josiah Carlson wrote:
>
> > The code to do so is simple:
>
> ...
> > p = -1
> > good = 1
> > while f1.tell() < p:
> > p = f1.tell()
> > if f1.read(b) == f2.read(b) == f3.read(b):
> > continue
>
> ...
>
> What is slightly amusing is that your *simple*
> solution is actually incorrect. You got the
> comparison backwards in the while loop.
My finger slipped.
> Other functional deficiency when compared to
> the cmp diffs is that you don't know which
> file has changed or which byte differs
> ... adding that brings about the potential
> for another set of bugs. Then someone else comes along
> who knows a little less about python and adds
> a little feature to the program that actually
> silently breaks it ...
You are now talking about code maintenance, which is a separate issue
from "how fast can I compare three files", which was originally asked
about.
> Whether or not it is actually faster remains
> to be seen. And that was my whole point,
> not to don't dismiss cmp to soon, see how it works
> test it, then armed with some real numbers one
> can make better decisions.
Ok. Let me see. I am using the fastest computer in the apartment; my
wife's 2.26 ghz P4 without hyperthreading, with a gig of memory, running
two 40GB ATA-100 drives. I have created 3x1.5gb files on one of the
drives (I don't have 4.5tb free), which should give us a reasonable
measure, if shy by around 3 orders of magnitude in terms of time.
cmp looks to be taking 0-12% processor, so I was correct in my statement
that it is likely disk bound (unless your disks are 8 times faster or
your processor is 1/10 as fast).
Oh, goodness, it is still going. I've been typing this email for over 8
minutes. It sure would be nice if cmp had a progress bar. At least
then I could know if I should kill it now or wait another few minutes.
To hell with it, I'm going to try the Python version.
*make little fix to code to give me a progress bar, time elapsed, and an
expected total time*
cmp: at least 8 1/2 minutes for the first cmp, was not finished, I
killed it. At least 17 minutes in total;
Python: 15 minutes, 3 seconds to finish it all.
Hrm, that is only around 5 megs/second. I think we can do better (even
if it is already better than cmp)...
New algorithm...
def compare_32(fn1, fn2, fn3):
mn = md5.new
st = time.time()
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
total = 1.5*2**30
digs = []
m = mn()
while f1.tell() > p:
p = f1.tell()
digs.append(m.update(f1.read(b)).digest())
if not ((p>>20)&7) and p:
a = 100.0*p/total/3
d = time.time()-st
print "%8.1f%% %8.0f\t\r"%(a,100*d/a),
m = mn()
for dig in digs:
if dig != m.update(f2.read(b)).digest():
print "files 1 and 2 differ before", f2.tell()
good = 0
break
m = mn()
for dig in digs:
if dig != m.update(f3.read(b)).digest():
print "files 1 and 2 differ before", f3.tell()
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()
That new one runs in 5 minutes 15 seconds total, because it exploits the
fact that sequential reads are fast. It does use ~20% processor to
compute the md5s, which only makes a difference if your processor is as
fast as a 400mhz P4.
I'd say that Python wins here. Is that concrete enough for you?
- Josiah
More information about the Python-list
mailing list