Efficient checksum calculating on lagre files

Nick Craig-Wood nick at craig-wood.com
Tue Feb 8 12:30:58 EST 2005


Ola Natvig <ola.natvig at infosense.no> wrote:
>  Hi all
> 
>  Does anyone know of a fast way to calculate checksums for a large file. 
>  I need a way to generate ETag keys for a webserver, the ETag of large 
>  files are not realy nececary, but it would be nice if I could do it. I'm 
>  using the python hash function on the dynamic generated strings (like in 
>  page content) but on things like images I use the shutil's 
>  copyfileobject function and the hash of a fileobject's hash are it's 
>  handlers memmory address.
> 
>  Does anyone know a python utility which is possible to use, perhaps 
>  something like the md5sum utility on *nix systems.

Here is an implementation of md5sum in python.  Its the same speed
give or take as md5sum itself.  This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

I discarded the first run so both tests ran with large_file in the
cache.

$ time md5sum large_file
e7668fdc06b68fbf087a95ba888e8054  large_file

real    0m1.046s
user    0m0.946s
sys     0m0.071s

$ time python md5sum.py large_file
e7668fdc06b68fbf087a95ba888e8054  large_file

real    0m1.033s
user    0m0.926s
sys     0m0.108s

$ ls -l large_file
-rw-r--r--  1 ncw ncw 115933184 Jul  8  2004 large_file


"""
Re-implementation of md5sum in python
"""

import sys
import md5

def md5file(filename):
    """Return the hex digest of a file without loading it all into memory"""
    fh = open(filename)
    digest = md5.new()
    while 1:
        buf = fh.read(4096)
        if buf == "":
            break
        digest.update(buf)
    fh.close()
    return digest.hexdigest()

def md5sum(files):
    for filename in files:
        try:
            print "%s  %s" % (md5file(filename), filename)
        except IOError, e:
            print >> sys.stderr, "Error on %s: %s" % (filename, e)

if __name__ == "__main__":
    md5sum(sys.argv[1:])

-- 
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick



More information about the Python-list mailing list