Efficient checksum calculating on lagre files
Nick Craig-Wood
nick at craig-wood.com
Tue Feb 8 12:30:58 EST 2005
Ola Natvig <ola.natvig at infosense.no> wrote:
> Hi all
>
> Does anyone know of a fast way to calculate checksums for a large file.
> I need a way to generate ETag keys for a webserver, the ETag of large
> files are not realy nececary, but it would be nice if I could do it. I'm
> using the python hash function on the dynamic generated strings (like in
> page content) but on things like images I use the shutil's
> copyfileobject function and the hash of a fileobject's hash are it's
> handlers memmory address.
>
> Does anyone know a python utility which is possible to use, perhaps
> something like the md5sum utility on *nix systems.
Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).
I discarded the first run so both tests ran with large_file in the
cache.
$ time md5sum large_file
e7668fdc06b68fbf087a95ba888e8054 large_file
real 0m1.046s
user 0m0.946s
sys 0m0.071s
$ time python md5sum.py large_file
e7668fdc06b68fbf087a95ba888e8054 large_file
real 0m1.033s
user 0m0.926s
sys 0m0.108s
$ ls -l large_file
-rw-r--r-- 1 ncw ncw 115933184 Jul 8 2004 large_file
"""
Re-implementation of md5sum in python
"""
import sys
import md5
def md5file(filename):
"""Return the hex digest of a file without loading it all into memory"""
fh = open(filename)
digest = md5.new()
while 1:
buf = fh.read(4096)
if buf == "":
break
digest.update(buf)
fh.close()
return digest.hexdigest()
def md5sum(files):
for filename in files:
try:
print "%s %s" % (md5file(filename), filename)
except IOError, e:
print >> sys.stderr, "Error on %s: %s" % (filename, e)
if __name__ == "__main__":
md5sum(sys.argv[1:])
--
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick
More information about the Python-list
mailing list