Why checksum? [was Re: Fuzzy Lookups]
Tom Anderson
twic at urchin.earth.li
Thu Feb 2 07:10:01 EST 2006
On Thu, 1 Feb 2006, it was written:
> Tom Anderson <twic at urchin.earth.li> writes:
>
>>> The obvious way is make a list of hashes, and sort the list.
>>
>> Obvious, perhaps, prudent, no. To make the list of hashes, you have to
>> read all of every single file first, which could take a while. If your
>> files are reasonably random at the beginning, ...
>
> The possibility of two different mp3 files having the same id3 tags is
> something you might specifically be checking for.
So read from the end of the file, rather than the beginning.
>> Better yet, note that if two files are identical, they must have the
>> same length, and that finding the length can be done very cheaply, so a
>> quicker yet approach is to make a list of lengths, sort that, and look
>> for duplicates; when you find one, do a byte-by-byte comparison of the
>> files (probably terminating in the first block) to see if they really
>> are the same.
>
> Yes, checking the file lengths first is an obvious heuristic, but if you
> fine you have a bunch of files with the same length, what do you do?
> You're back to a list of hashes.
Or prefixes or suffixes.
>> By way of example, of the 2690 music files in my iTunes library, i have
>> twelve pairs of same-sized files [1], and all of these differ within
>> the first 18 bytes (mostly, within the first 9 bytes).
>
> That's a small enough set of matches that you don't need a general
> purpose algorithm.
True - and this is *exactly* the situation that the OP was talking about,
so this algorithm is appropriate. Moreover, i believe is representative of
most situations where you have a bunch of files to compare. Of course,
cases where files are tougher to tell apart do exist, but i think they're
corner cases. Could you suggest a common kind of file with degenerate
lengths, prefixes and suffixes?
The only one that springs to mind is a set of same-sized image files in
some noncompressed format, recording similar images (frames in a movie,
say), where the differences might be buried deep in the pixel data. As it
happens, i have just such a dataset on disk: with the images in TIFF
format, i get differences between subsequent frames after 9 bytes, but i
suspect that's a timestamp or something; if i convert everything to a nice
simple BMP (throwing away 8 bits per sample of precision in the process -
probably turning most of the pixels to 0!), then i find differences about
a megabyte in. If i compare from the tail in, i also have to wade through
about a megabyte before finding a difference. Here, hashes would be ideal.
tom
--
The revolution is here. Get against the wall, sunshine. -- Mike Froggatt
More information about the Python-list
mailing list