Shorter checksum than MD5

Sat Sep 11 01:17:47 EDT 2004

[Paul Rubin]
>> ...
>> Also, with CRC32, it's very easy to create a record on purpose that has any
>> given checksum.  Is THAT a problem? ...

[Elbert Lev]
> Really?

Yup.

> If this is "very easy" please modify the string to have the same CRC32:

I will, but I'm going to lecture first <wink>.  CRC algorithms were
designed to catch common (especially in older days) kinds of
transmission corruption, not to resist attacks.  CRC wants to guard
against things like single-bit inversions, double-bit inversions, and
bursts of zero or one bits.  There are physical causes for those kinds
of corruption.

Now suppose your messages were big integers instead of strings.  A
decent checksum algorithm would be to send, along with your big
integer N, its remainder modulo a fixed prime P.  For concreteness,
say P is 1003.  Then

def pseudocrc(N):
    return N % 1003

That will catch many kinds of corruption.  But it's dead easy to fool
if you want to!  For any given "message" N, the set of messages with
the same pseudocrc is exactly the set of all integers of the form
N+P*i for some integer i.  Given N, it's not only easy to find *a*
distinct message with the same pseudocrc, it's easy to find *all*
messages with the same pseudocrc.

Now "real" CRCs have essentially the same story!  They use a different
form of arithmetic ("binary polynomials", informally speaking), but
it's the same thing:  within that arithmetic, the CRC of a string is
just the remainder of the string (viewed as a polynomial) modulo a
fixed polynomial P.

Real CRC algorithms are more complicated than just that, playing goofy
but shallow tricks with bit inversions and byte reversals, and playing
seriously cool tricks with optimization.  That can make it hard to
read a chunk of CRC code and realize what it's doing.  Still, at
heart, it's just computing a remainder.

To show a string with the same CRC as yours above, we need some
utility functions first:

import binascii

# View a string as a giant base-256 integer, and return the latter.
def str2long(s):
    h = binascii.hexlify(s)
    return long(h, 16)

# View a long as a base-256 integer, and return the string.
# This is str2long's inverse.
def long2str(n):
    h = hex(n)
    assert h.startswith('0x')
    h = h[2:]
    if h.endswith('L'):
        h = h[:-1]
    if len(h) & 1:
        h = '0' + h
    return binascii.unhexlify(h)

# Compute binascii.crc32 on string s with given salt,
# and return as a hex string.
def crc(s, salt):
    return hex(binascii.crc32(s, salt))

If you don't like binascii.crc32 (there are many CRC polynomials in
use), you'll have to figure out a different P.  For binascii.crc32,
the magic value is

P = 0x196300777

Now a driver function:

# Compute the crc (with given salt) on s, and on s again
# after xor'ing toxor into it.
def checkit(s, toxor, salt=0):
    t = long2str(str2long(s) ^ toxor)
    print repr(s)
    print repr(t)
    print crc(s, salt), crc(t, salt)

and your string:

s = "The probability of the changed dada having the same CRC32  is 1/2**32"

Let's try it:

>>> checkit(s, 0)
'The probability of the changed dada having the same CRC32  is 1/2**32'
'The probability of the changed dada having the same CRC32  is 1/2**32'
0x5a91f29 0x5a91f29

That's not amazing, because xor'ing 0 into your string didn't change
it <wink>.  But xor'ing P into s doesn't change the crc either:

>>> checkit(s, P)
'The probability of the changed dada having the same CRC32  is 1/2**32'
'The probability of the changed dada having the same CRC32  is 1/3\xbc\x1a4E'
0x5a91f29 0x5a91f29

More, xor'ing in "multiples" of P also don't change the crc.  The
underlying arithmetic uses xor instead of addition, and for obscure
technical reasons (having to do with details of the binascii.crc32
implementation) we have to restrict shift counts to multiples of 8. 
So I'll pick one out of my hat:

>>> checkit(s, P<<304 ^ P<<296 ^ P<<128 ^ P<<64)
'The probability of the changed dada having the same CRC32  is 1/2**32'
"The probability of the chao\xf0\xc3SP\x13ada having the
s`\xfbU'4RC33\xb6\x10n\x04 1/2**32"
0x5a91f29 0x5a91f29

And so on -- we can systematically generate an unbounded number of
distinct strings with the same crc as your string.

> ...
> To make the task even easier, I will not give you the "salt" value :)

Surprise again:  that doesn't make any difference.  The pairs of
strings above have the same crc for all salt values -- adding a salt
doesn't do any more good for CRCs than adding this to our pseudocrc
function:

def pseudocrc(N, salt):
    return (N + salt) % 1003

That is, it just shifts the result by a constant, and so has no effect
on which pairs of integers map to the same checksum.  One concrete
example:

>>> checkit(s, P, 0x12345678)
'The probability of the changed dada having the same CRC32  is 1/2**32'
'The probability of the changed dada having the same CRC32  is 1/3\xbc\x1a4E'
0x75a90579 0x75a90579

The real purpose of a CRC salt (it doesn't, and can't, improve
"security") is to let you compute a CRC of a very large blob of data
in chunks, feeding the result of one chunk to the next crc(chunk) call
as "the salt".