comparing multiple copies of terabytes of data?

Fri Oct 29 00:46:35 EDT 2004

Dan Stromberg wrote:
> that I hope to use correctly the first time, in verifying that our three
> copies of a 3+ terrabyte collection of data...  well, that the first copy

Not trying to be too much of a pedant, but the prefix is "tera"
not "terra".

> The URL is http://dcs.nac.uci.edu/~strombrg/verify.html

To make up for that pedentry, here's some pointers

	for i in range(len(dirlist)):
		fullfilename = os.path.join(dirlist[i],filename)
		os.chdir(dirlist[i])
		statbufs.append(os.lstat(fullfilename))

could be rewritten as

   for dirname in dirlist:
     fullfilename = os.path.join(dirname, filename)
     os.chdir(dirname)
     statbufs.append(os.lstat(fullfilename)

BTW, do you really need that chdir?  You've already got the full
filename so the current directory doesn't make a difference.  Unless
you allow relative directories there ... ?

   for i in range(len(dirlist)-1):
     for field in [stat.ST_GID, stat.ST_UID, stat.ST_MODE, stat.ST_SIZE]:
       if statbufs[i][field] != statbufs[i+1][field]:
         return 0

Can dirlist ever be empty?

You just want to check if all elements in statbufs are the
same, right?  I prefer this style

first = statbufs[0]
for statbuf in statbufs[1:]:
   for field in (stat.ST_GID, stat.ST_UID, stat.ST_MODE, stat.ST_SIZE]:
     if first[field] != statbuf[field]:
       return 0

because I find the chaining-if comparing i and i+1 harder
to understand.  Even if this is one line longer.

def main():
   try:
     (list,dirs) = getopt.getopt(sys.argv[1:],'n:')
   except:
     usage()
   if len(dirs) == 0:
     usage()

Have you tried to see if your usage works when you give it
the wrong parameters?  It will print out the error message
then give a NameError because dirs doesn't exist.  Try this
instead

def main():
   try:
     (list,dirs) = getopt.getopt(sys.argv[1:],'n:')
   except getopt.error, err:
     usage()
     raise SystemExit(str(err))

This will only catch getopt errors and it will raise
a system exit so the 1) it prints out the text of the
getopt error (helpful for users to know what went wrong)
and 2) it sets the program's exit code to 1.

	for dir in dirs:
		os.chdir(dir)
	os.chdir(dirs[0])

Again, I don't understand the chdir calls here.  Perhaps
I should have read the rest of this thread?

If you want to see if that directory exists, and want
to do it in this style, you should return to the original
directory.  Why?  Because this will fail if someone
specifies relative directories on the command-line.

	origdir = os.getcwd()
	for dir in dirs:
		os.chdir(dir)
		os.chdir(origdir)
	os.chdir(dirs[0])

I still find it suspicious.  I almost never use chdir
in my code because it can really confuse libraries
that expect relative paths to always work during the
lifetime of a program.

main()

That should be

if __name__ == "__main__":
     main()

because it lets you import the file as well as use it
as a mainline program.  Why is that useful?  It makes
regression tests a lot easier to write.  To help that
I'll also do

if __name__ == "__main__":
     main(sys.argv)

then make it so that the main function takes an argv.

def main(argv):
   ...

Again, it makes it easier to write tests that way.

Hmmm, or is this response also showing me to be a pedant?

:)

Best wishes,

				Andrew
				dalke at dalkescientific.com