Why checksum? [was Re: Fuzzy Lookups]

Paul Rubin http
Wed Feb 1 19:52:34 EST 2006


Steven D'Aprano <steve at REMOVETHIScyber.com.au> writes:
> Sure. But if you are just comparing two files, is there any reason to
> bother with a checksum? (MD5 or other.)

No of course not, except in special situations, like some problem
opening and reading both files simultaneously.  E.g.: the files are on
two different DVD-R's, they are too big to fit in ram, and you only
have one DVD drive.  If you want to compare byte by byte, you have to
either copy one of the DVD's to your hard disk (if you have the space
available) or else swap DVD's back and forth in the DVD drive reading
and comparing a bufferload at a time.  But you can easily read in the
first DVD and compute its hash on the fly, then read and hash the
second DVD and compare the hashes.

If it's a normal situation with two files on HD, just open both files
simultaneously, and use large buffers to keep the amount of seeking
reasonable.  That will be faster than big md5 computations, and more
reliable (there are known ways to construct pairs of distinct files
that have the same md5 hash.)



More information about the Python-list mailing list