binary file compare...

Steven D'Aprano steven at REMOVE.THIS.cybersource.com.au
Wed Apr 15 05:03:16 EDT 2009


On Wed, 15 Apr 2009 07:54:20 +0200, Martin wrote:

>> Perhaps I'm being dim, but how else are you going to decide if two
>> files are the same unless you compare the bytes in the files?
> 
> I'd say checksums, just about every download relies on checksums to
> verify you do have indeed the same file.

The checksum does look at every byte in each file. Checksumming isn't a 
way to avoid looking at each byte of the two files, it is a way of 
mapping all the bytes to a single number.



>> You could hash them and compare the hashes, but that's a lot more work
>> than just comparing the two byte streams.
> 
> hashing is not exactly much mork in it's simplest form it's 2 lines per
> file.

Hashing is a *lot* more work than just comparing two bytes. The MD5 
checksum has been specifically designed to be fast and compact, and the 
algorithm is still complicated:

http://en.wikipedia.org/wiki/MD5#Pseudocode

The reference implementation is here:

http://www.fastsum.com/rfc1321.php#APPENDIXA

SHA-1 is even more complicated still:

http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudocode


Just because *calling* some checksum function is easy doesn't make the 
checksum function itself simple. They do a LOT more work than just a 
simple comparison between bytes, and that's totally unnecessary work if 
you are making a one-off comparison of two local files.



-- 
Steven



More information about the Python-list mailing list