Shorter checksum than MD5

Jeff Epler jepler at unpythonic.net
Thu Sep 9 09:56:33 EDT 2004


If you have a good cryptographic hash function then picking any N bits
from it should be just as good as any other N bits.

sha and md5 were designed with cryptographic needs in mind, while CRC
wasn't (it's designed to detect spurious bit errors in noisy transmission
media, which tend to have particular properties), but collisions have
now been found in md5 which means most people would not recommend its
use in new software.  I would choose SHA at this point, even though new
research implies it may be weak too. (an attack on a simplified version
of SHA can produce collisions)  On the other hand, if your database
replication app doesn't have a threat model where an attacker would want
to deliberately make the two sites out of sync, a cryptographically weak
hash might still be acceptable.

For your application, you should consider the total number of records
you ever expect to have, and use more than 2 * lg(records) bits of hash.
Due to the so-called "birthday paradox", when you have N possible hash
values, two will be identical with 50% probability with around sqrt(N)
items.  You'd probably prefer that the probability be much lower in your
application, since a collision will result in incorrect results.

You might consider some way to group records together, so that it's not
a hash per 70-byte record, but a hash per N 70-byte records.  Or you
might skip this approach and implement transactions which can be stored
and played to the second server when they sync up.

Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20040909/1111b419/attachment.sig>


More information about the Python-list mailing list