Efficient checksum calculating on lagre files

Nick Craig-Wood nick at craig-wood.com
Wed Feb 9 05:31:22 EST 2005


Fredrik Lundh <fredrik at pythonware.com> wrote:
>  on my machine, Python's md5+mmap is a little bit faster than
>  subprocess+md5sum:
> 
>      import os, md5, mmap
> 
>      file = open(fn, "r+")
>      size = os.path.getsize(fn)
>      hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()
> 
>  (I suspect that md5sum also uses mmap, so the difference is
>  probably just the subprocess overhead)

But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you?  (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

$ dd if=/dev/zero of=z count=1 bs=1048576 seek=8192
$ ls -l z
-rw-r--r--  1 ncw ncw 8590983168 Feb  9 09:26 z

>>> fn="z"
>>> import os, md5, mmap
>>> file = open(fn, "rb")
>>> size = os.path.getsize(fn)
>>> size
8590983168L
>>> hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
OverflowError: memory mapped size is too large (limited by C int)
>>> 

-- 
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick



More information about the Python-list mailing list