Efficient checksum calculating on lagre files
Nick Craig-Wood
nick at craig-wood.com
Wed Feb 9 05:31:22 EST 2005
Fredrik Lundh <fredrik at pythonware.com> wrote:
> on my machine, Python's md5+mmap is a little bit faster than
> subprocess+md5sum:
>
> import os, md5, mmap
>
> file = open(fn, "r+")
> size = os.path.getsize(fn)
> hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()
>
> (I suspect that md5sum also uses mmap, so the difference is
> probably just the subprocess overhead)
But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)
$ dd if=/dev/zero of=z count=1 bs=1048576 seek=8192
$ ls -l z
-rw-r--r-- 1 ncw ncw 8590983168 Feb 9 09:26 z
>>> fn="z"
>>> import os, md5, mmap
>>> file = open(fn, "rb")
>>> size = os.path.getsize(fn)
>>> size
8590983168L
>>> hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
OverflowError: memory mapped size is too large (limited by C int)
>>>
--
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick
More information about the Python-list
mailing list