python snippet request: calculate MD5 checksum on 650 MB ISO cdrom image quickly

Tue Oct 24 13:39:05 EDT 2000

[Warren Postma]
> I am writing some python scripts to manage downloading (and
> re-downloading) ISO images from FTP mirrors, and doing MD5 checksums
> on the received files to make sure they are intact.
>
> I noticed that there is an MD5 message digest module in Python.
> But it only accepts STRINGS.  Is there some way to pass a WHOLE FILE to
> it, less awkwardly than having a WHILE loop that reads 1k chunks and
> passes it along to the MD5 module.

You can read the entire file into a string at one gulp, via e.g. f.read().
One-liner.

> A new function in the md5 module, md5.md5file(filename) would be nice,
> if anyone is listening, for a future python 2.x release.
>
> I'll contribute a patch if anyone thinks it's a good idea.

Alas, I don't:  there's no magic to be had here.  Such a function will have
to make up its own policy for chunking the file input, and one size doesn't
fit all.  The "while" loop is trivial to write, and really has no bad effect
on speed even if written in Python (1Kb chunks are very small, btw -- why
not use 64Kb, or 1Mb, chunks?  the "sweet spot" on your system is something
you can determine (see below), but a builtin md5file method can't guess for
you).

> If there's already a one-to-ten line method that doesn't involve
> iterating (SLOWLY) over chunks of the file that got loaded into Strings
> that would be great.
>
> My current method is to use os.popen("md5 "+filename"), which
> then spawns a command line md5 utility, which seems to me to be kind
> of wasteful and slow.

Why the emphasis on "SLOWLY" and "slow"?  You may be missing that md5 is
*designed* to be slow <0.9 wink>!  It's supposed to give *such* a good hash
that it's computationally intractable to fool it on purpose, and it does a
lot of work to achieve that.  If you want a *faster* checksum, then e.g. use
crc32 instead.  CRCs are easy to fool on purpose, but cheaper to compute.
I'll attach a little timing harness to compare crc32, md5 and sha speed;
plug in the name of a large file (TEST) and play w/ different chunking
factors to see what happens; you'll usually find that crc32 is faster than
md5 is faster than sha.

import md5, binascii, sha, time, os

def getcrc32(f, CHUNK=2**16):
    result = 0
    while 1:
        chunk = f.read(CHUNK)
        if not chunk:
            break
        result = binascii.crc32(chunk, result)
    return result

def getmd5(f, CHUNK=2**16):
    m = md5.new()
    while 1:
        chunk = f.read(CHUNK)
        if not chunk:
            break
        m.update(chunk)
    return m.digest()

def getsha(f, CHUNK=2**16):
    m = sha.new()
    while 1:
        chunk = f.read(CHUNK)
        if not chunk:
            break
        m.update(chunk)
    return m.digest()

TEST = "/Python20/huge.html"

print "Timing", TEST, "w/ size", os.path.getsize(TEST)

# Warm up the system file cache, to avoid penalizing the first func.
f = open(TEST, "rb")
f.read()
f.close()

for func in getcrc32, getmd5, getsha:
    f = open(TEST, "rb")
    start = time.clock()
    func(f, 2**25)
    finish = time.clock()
    f.close()
    print "Time for", func.__name__, "is", round(finish - start, 3)