Why checksum? [was Re: Fuzzy Lookups]

Wed Feb 1 19:45:36 EST 2006

Tom Anderson <twic at urchin.earth.li> writes:
> > The obvious way is make a list of hashes, and sort the list.
> 
> Obvious, perhaps, prudent, no. To make the list of hashes, you have to
> read all of every single file first, which could take a while. If your
> files are reasonably random at the beginning, ...

The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for.  Hashing the first
few hundred bytes would be more reliable, since it would include some
actual audio in the hash.  But then you're back to a list of hashes.

> Better yet, note that if two files are identical, they must have the
> same length, and that finding the length can be done very cheaply, so
> a quicker yet approach is to make a list of lengths, sort that, and
> look for duplicates; when you find one, do a byte-by-byte comparison
> of the files (probably terminating in the first block) to see if they
> really are the same.

Yes, checking the file lengths first is an obvious heuristic, but if
you fine you have a bunch of files with the same length, what do you
do?  You're back to a list of hashes.

> By way of example, of the 2690 music files in my iTunes library, i
> have twelve pairs of same-sized files [1], and all of these differ
> within the first 18 bytes (mostly, within the first 9 bytes).

That's a small enough set of matches that you don't need a general
purpose algorithm.