Why checksum? [was Re: Fuzzy Lookups]

Tue Jan 31 16:38:50 EST 2006

Steven D'Aprano <steve at REMOVETHIScyber.com.au> writes:
> This isn't a criticism, it is a genuine question. Why do people compare
> local files with MD5 instead of doing a byte-to-byte compare? Is it purely
> a caching thing (once you have the checksum, you don't need to read the
> file again)? Are there any other reasons?

It's not just a matter of comparing two files.  The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that).  The obvious way is make a list of hashes, and sort
the list.