md5 and large files

Roger Binns rogerb at rogerbinns.com
Sun Oct 17 21:36:15 EDT 2004


Brad Tilley wrote:
> I would like to verify that the files are not corrupt so what's the
> most efficient way to calculate md5 sums on 4GB files? The machine
> doing the calculations is a small desktop with 256MB of RAM.

If you need to be 100% certain, then only doing md5sum over the entire
file will work as Tim points out.

If you don't need to be 100% certain, then you can pick random blocks
out of the file and check their sums only.  The more blocks you use,
the more certain you will be.

You may also want to consider external tools such as rsync which can
transfer large files, but also when the source changes sends the minimum
to keep the destination up to date.  It will also cope correctly (and
efficiently) if network connections break during transfers.

Other things you can do is use the random method above to check a few
blocks (always check the begining and end since they would be the most
likely to be corrupted), and then schedule a more thorough background
check on the files which you may be able to offload to another machine
or another time.

The one thing you don't want to do is check multiple files at the same
time.  That will cause the disk heads to keep jumping around and drastically
slow overal disk throughput.

You also didn't say if you are actually having a performance problem.  On
my machine, I did some timing tests against a 2GB file.  Using the md5sum
program that comes with the operating system, it took 47.5 seconds.  I
then tried the Python md5sum program.  It took 57 seconds with 8kb or 64kb
buffer sizes but 65.8 seconds with a 1MB buffer size.  (The Python times
include the time to start and stop the interpretter).

So in my case the machine can do 1GB every 30 seconds (ATA100 controller
and disk under Linux).

Roger 





More information about the Python-list mailing list