os.walk/list

Dan Stromberg drsalists at gmail.com
Sat Mar 19 22:19:49 EDT 2011


You're not really supposed to call into the md5 module directly anymore; you
might use hashlib instead.

But actually, using a cryptographic hash doesn't really help comparing just
one pair of files; it's more certain to do a block by block comparison, and
the I/O time is roughly the same - actually, it's less when the files
differ.  Comparing unbroken cryptographic hashes gives near certainty (md5
is broken), but file compares give certainty.  So file compares give
slightly greater certainty in less time.

But if you're comparing 1,000,000 files, each against all the others,
cryptographic hashes help a lot.

This example code does lots of file comparisons, including via hashes - the
equivs3d version is probably the best example.  equivs3d reduces dividing a
huge number of files into groups having equal content, to almost an O(n)
operation, without assuming the hashes always tell the truth:
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html

Also, I believe it's preferred to use open() rather than file().

You could probably avoid writing a list of filenames by doing your
comparisons at the same time you do your copies.  It doesn't provide quite
as strong assurance that way, but it does get rid of the need to save what
files were processed.  You could also do a second os.walk, but of course,
that's subject to issues when one of the trees has been changed by something
other than your copying program.

Finally...  Why not just use rsync or robocopy?

On Sat, Mar 19, 2011 at 5:45 PM, ecu_jon <hayesjdno3 at yahoo.com> wrote:

> so i am trying to add md5 checksum calc to my file copy stuff, to make
> sure the source and dest. are same file.
> i implemented it fine with the single file copy part. something like :
> for files in sourcepath:
>        f1=file(files ,'rb')
>        try:
>            shutil.copy2(files,
> os.path.join(destpath,os.path.basename(files)))
>        except:
>            print "error file"
>        f2=file(os.path.join(destpath,os.path.basename(files)), 'rb')
>        truth = md5.new(f1.read()).digest() ==
> md5.new(f2.read()).digest()
>        if truth == 0:
>            print "file copy error"
>
> this worked swimmingly. i moved on to my backupall function, something
> like
> for (path, dirs, files) in os.walk(source):
>        #os.walk drills down thru all the folders of source
>        for fname in dirs:
>           currentdir = destination+leftover
>            try:
>               os.mkdir(os.path.join(currentdir,fname),0755)
>            except:
>                print "error folder"
>        for fname in files:
>            leftover = path.replace(source, '')
>            currentdir = destination+leftover
>            f1=file(files ,'rb')
>            try:
>                shutil.copy2(os.path.join(path,fname),
>                             os.path.join(currentdir,fname))
>                f2 = file(os.path.join(currentdir,fname,files))
>            except:
>                print "error file"
>            truth = md5.new(f1.read()).digest() ==
> md5.new(f2.read()).digest()
>            if truth == 0:
>                print "file copy error"
>
> but here, "fname" is a list, not a single file.i didn't really want to
> spend a lot of time on the md5 part. thought it would be an easy add-
> on. i don't really want to write the file names out to a list and
> parse through them one a time doing the calc, but it sounds like i
> will have to do something like that.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110319/f9da742d/attachment-0001.html>


More information about the Python-list mailing list