comparing multiple copies of terrabytes of data?
Josiah Carlson
jcarlson at uci.edu
Mon Oct 25 16:01:20 EDT 2004
Istvan Albert <ialbert at mailblocks.com> wrote:
>
> Dan Stromberg wrote:
>
> > Rather than cmp'ing twice, to verify data integrity, I was thinking we
> > could speed up the comparison a bit, by using a python script that does 3
>
> Use the cmp. So what if you must run it twice ... by the way I
> really doubt that you could speed up the process in python
> ... you'll probably end up with a much slower version
In this case you would be wrong. Comparing data on a processor is
trivial (and would be done using Python's C internals anyways if a
strict string equality is all that matters), but IO is expensive.
Reading Terabytes of data is going to be the bottleneck, so reducing IO
is /the/ optimization that can and should be done.
The code to do so is simple:
def compare_3(fn1, fn2, fn3):
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
while f1.tell() < p:
p = f1.tell()
if f1.read(b) == f2.read(b) == f3.read(b):
continue
print "files differ"
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()
Note that it /may/ be faster to first convert the data into arrays
(module array) to get 2, 4 or 8 byte block comparisons.
- Josiah
More information about the Python-list
mailing list