python snippet request: calculate MD5 checksum on 650 MB ISO cdrom image quickly

Tue Oct 24 17:31:28 EDT 2000

I run test script Tim posted and got the following results on my system:

Timing C:\Program Files\Microsoft Visual
Studio\MSDN98\98VS\1033\MSDNVS98.CHQ w/ size 95823736
Time for getcrc32 is 29.852
Time for getmd5 is 29.444
Time for getsha is 31.513

The system is Windows NT4 SP6, Dual Pentium II 300 Mhz

"Tim Peters" <tim_one at email.msn.com> wrote in message
news:mailman.972409263.1128.python-list at python.org...
> [Warren Postma]
> > I am writing some python scripts to manage downloading (and
> > re-downloading) ISO images from FTP mirrors, and doing MD5 checksums
> > on the received files to make sure they are intact.
> >
> > I noticed that there is an MD5 message digest module in Python.
> > But it only accepts STRINGS.  Is there some way to pass a WHOLE FILE to
> > it, less awkwardly than having a WHILE loop that reads 1k chunks and
> > passes it along to the MD5 module.
>
> You can read the entire file into a string at one gulp, via e.g. f.read().
> One-liner.
>
> > A new function in the md5 module, md5.md5file(filename) would be nice,
> > if anyone is listening, for a future python 2.x release.
> >
> > I'll contribute a patch if anyone thinks it's a good idea.
>
> Alas, I don't:  there's no magic to be had here.  Such a function will
have
> to make up its own policy for chunking the file input, and one size
doesn't
> fit all.  The "while" loop is trivial to write, and really has no bad
effect
> on speed even if written in Python (1Kb chunks are very small, btw -- why
> not use 64Kb, or 1Mb, chunks?  the "sweet spot" on your system is
something
> you can determine (see below), but a builtin md5file method can't guess
for
> you).
>
> > If there's already a one-to-ten line method that doesn't involve
> > iterating (SLOWLY) over chunks of the file that got loaded into Strings
> > that would be great.
> >
> > My current method is to use os.popen("md5 "+filename"), which
> > then spawns a command line md5 utility, which seems to me to be kind
> > of wasteful and slow.
>
> Why the emphasis on "SLOWLY" and "slow"?  You may be missing that md5 is
> *designed* to be slow <0.9 wink>!  It's supposed to give *such* a good
hash
> that it's computationally intractable to fool it on purpose, and it does a
> lot of work to achieve that.  If you want a *faster* checksum, then e.g.
use
> crc32 instead.  CRCs are easy to fool on purpose, but cheaper to compute.
> I'll attach a little timing harness to compare crc32, md5 and sha speed;
> plug in the name of a large file (TEST) and play w/ different chunking
> factors to see what happens; you'll usually find that crc32 is faster than
> md5 is faster than sha.
>
> import md5, binascii, sha, time, os
>
> def getcrc32(f, CHUNK=2**16):
>     result = 0
>     while 1:
>         chunk = f.read(CHUNK)
>         if not chunk:
>             break
>         result = binascii.crc32(chunk, result)
>     return result
>
> def getmd5(f, CHUNK=2**16):
>     m = md5.new()
>     while 1:
>         chunk = f.read(CHUNK)
>         if not chunk:
>             break
>         m.update(chunk)
>     return m.digest()
>
> def getsha(f, CHUNK=2**16):
>     m = sha.new()
>     while 1:
>         chunk = f.read(CHUNK)
>         if not chunk:
>             break
>         m.update(chunk)
>     return m.digest()
>
> TEST = "/Python20/huge.html"
>
> print "Timing", TEST, "w/ size", os.path.getsize(TEST)
>
> # Warm up the system file cache, to avoid penalizing the first func.
> f = open(TEST, "rb")
> f.read()
> f.close()
>
> for func in getcrc32, getmd5, getsha:
>     f = open(TEST, "rb")
>     start = time.clock()
>     func(f, 2**25)
>     finish = time.clock()
>     f.close()
>     print "Time for", func.__name__, "is", round(finish - start, 3)
>
>
>