Why checksum? [was Re: Fuzzy Lookups]

Wed Feb 1 05:37:57 EST 2006

On Tue, 31 Jan 2006, it was written:

> Steven D'Aprano <steve at REMOVETHIScyber.com.au> writes:
>
>> This isn't a criticism, it is a genuine question. Why do people compare 
>> local files with MD5 instead of doing a byte-to-byte compare?

I often wonder that!

>> Is it purely a caching thing (once you have the checksum, you don't 
>> need to read the file again)? Are there any other reasons?
>
> It's not just a matter of comparing two files.  The idea is you have 
> 10,000 local files and you want to find which ones are duplicates (i.e. 
> if files 637 and 2945 have the same contents, you want to discover 
> that).  The obvious way is make a list of hashes, and sort the list.

Obvious, perhaps, prudent, no. To make the list of hashes, you have to 
read all of every single file first, which could take a while. If your 
files are reasonably random at the beginning, you'd be better off just 
using the first N bytes of the file, since this would be just as 
effective, and cheaper to read. Looking at some random MP3s i have to 
hand, they all differ within the first 20 bytes - probably due to the ID3 
tags, so this should work for these.

Better yet, note that if two files are identical, they must have the same 
length, and that finding the length can be done very cheaply, so a quicker 
yet approach is to make a list of lengths, sort that, and look for 
duplicates; when you find one, do a byte-by-byte comparison of the files 
(probably terminating in the first block) to see if they really are the 
same.

By way of example, of the 2690 music files in my iTunes library, i have 
twelve pairs of same-sized files [1], and all of these differ within the 
first 18 bytes (mostly, within the first 9 bytes). Therefore, i could rule 
out duplication with just 22 data blocks read from disk (plus rather more 
blocks of directory information and inodes, of course). A hash-based 
approach would have had to wade through a touch over 13 GB of data before 
it could even get started.

Of course, there are situations where this is the wrong approach - if you 
have a collection of serialised sparse matrices, for example, which 
consist of identically-sized blocks of zeroes with a scattering of ones 
throughout, then lengths and prefixes will be useless, whereas hashes will 
work perfectly. However, here, we're looking at MP3s, where lengths and 
prefixes will be a win.

tom

[1] The distribution of those is a bit weird: ten pairs consist of two 
tracks from The Conet Project's 'Recordings of Shortwave Numbers 
Stations', one is a song from that and The Doors' 'Horse Latitudes', and 
one is between to Calexico songs ('The Ride (Pt II)' and 'Minas De 
Cobre'). Why on earth are eleven of the twelve pairs pairs of songs from 
the same artist? Is it really that they're pairs of songs from the same 
compressor (those tracks aren't from CD), i wonder?

-- 
Not all legislation can be eye-catching, and it is important that the
desire to achieve the headlines does not mean that small but useful
measures are crowded out of the legislative programme. -- Select Committee
on Transport