Shorter checksum than MD5

Paul Rubin http
Thu Sep 9 06:59:57 EDT 2004


Mercuro <this at is.invalid> writes:
> i'm looking for a simple way to checksum my data. The data is 70 bytes
> long per record, so a 32 byte hex md5sum would increase the size of my
> mysql db a lot.

If the data is binary, the md5 checksum is 16 bytes, not 32.

> I'm looking for something that is 5 bytes long, for the moment i'm
> just taking a part of the hex md5 sum (like this: checksum =
> md5sum[3:8]).  I don't have any duplicates, and I have over 100000
> records, but i'm not sure for the future...

Using 5 hex digits would give you just 20 bits of hash, so you would
almost definitely get collisions with that many records.

> PS: I use this checksum to periodically compare 2 versions of this DB,
> which are on 2 sides of a slow internet connection.  My hope is to
> keep down unneeded traffic between the 2 servers.

How about putting a timestamp in each record, so you only have to
compare the records that have been updated since the last period
comparison.  

Or, if you expect only occasional changes, you could compare hashes of
long runs of records, then narrow down the comparisons to locate the
records that actually differ.  You could straightforwardly put a tree
structure over the hashes, but maybe there's some even better way.



More information about the Python-list mailing list