md5 and large files

Andrew Dalke adalke at mindspring.com
Sun Oct 17 21:32:48 EDT 2004


Nelson Minar wrote:
> If all you want to do is verify that a file is not corrupt, MD5 is the
> wrong algorithm to use. Use something fast like crc32.

How much faster is that in Python?  It looks about the
same to me.

 >>> def crc32file(infile):
...   crc = 0
...   while 1:
...     s = infile.read(16384)
...     if not s:
...       return crc
...     crc = binascii.crc32(s, crc)
...
 >>> def md5file(infile):
...   md5obj = md5.new()
...   while 1:
...     s = infile.read(16384)
...     if not s:
...       return md5obj.hexdigest()
...     md5obj.update(s)
...
 >>> os.path.getsize("/Users/dalke/databases/sprot/sprot40.dat")
320673785L
 >>> if 1:
...   t1 = time.time()
...   print md5file(open("/Users/dalke/databases/sprot/sprot40.dat"))
...   t2 = time.time()
...   print t2-t1
...
a2f54de61e4db857aadce04298ab177e
10.9378840923
 >>> if 1:
...   t1 = time.time()
...   print crc32file(open("/Users/dalke/databases/sprot/sprot40.dat"))
...   t2 = time.time()
...   print t2-t1
...
-1921799528
10.7424199581
 >>>

I think most of the time is spent doing I/O, not computing
the checksum.  That's probably even true if written in C.

				Andrew
				dalke at dalkescientific.com



More information about the Python-list mailing list