Why checksum? [was Re: Fuzzy Lookups]

Thu Feb 2 07:10:01 EST 2006

On Thu, 1 Feb 2006, it was written:

> Tom Anderson <twic at urchin.earth.li> writes:
>
>>> The obvious way is make a list of hashes, and sort the list.
>>
>> Obvious, perhaps, prudent, no. To make the list of hashes, you have to 
>> read all of every single file first, which could take a while. If your 
>> files are reasonably random at the beginning, ...
>
> The possibility of two different mp3 files having the same id3 tags is 
> something you might specifically be checking for.

So read from the end of the file, rather than the beginning.

>> Better yet, note that if two files are identical, they must have the 
>> same length, and that finding the length can be done very cheaply, so a 
>> quicker yet approach is to make a list of lengths, sort that, and look 
>> for duplicates; when you find one, do a byte-by-byte comparison of the 
>> files (probably terminating in the first block) to see if they really 
>> are the same.
>
> Yes, checking the file lengths first is an obvious heuristic, but if you 
> fine you have a bunch of files with the same length, what do you do? 
> You're back to a list of hashes.

Or prefixes or suffixes.

>> By way of example, of the 2690 music files in my iTunes library, i have 
>> twelve pairs of same-sized files [1], and all of these differ within 
>> the first 18 bytes (mostly, within the first 9 bytes).
>
> That's a small enough set of matches that you don't need a general 
> purpose algorithm.

True - and this is *exactly* the situation that the OP was talking about, 
so this algorithm is appropriate. Moreover, i believe is representative of 
most situations where you have a bunch of files to compare. Of course, 
cases where files are tougher to tell apart do exist, but i think they're 
corner cases. Could you suggest a common kind of file with degenerate 
lengths, prefixes and suffixes?

The only one that springs to mind is a set of same-sized image files in 
some noncompressed format, recording similar images (frames in a movie, 
say), where the differences might be buried deep in the pixel data. As it 
happens, i have just such a dataset on disk: with the images in TIFF 
format, i get differences between subsequent frames after 9 bytes, but i 
suspect that's a timestamp or something; if i convert everything to a nice 
simple BMP (throwing away 8 bits per sample of precision in the process - 
probably turning most of the pixels to 0!), then i find differences about 
a megabyte in. If i compare from the tail in, i also have to wade through 
about a megabyte before finding a difference. Here, hashes would be ideal.

tom

-- 
The revolution is here. Get against the wall, sunshine. -- Mike Froggatt