Compression of random binary data

Gregory Ewing greg.ewing at canterbury.ac.nz
Tue Oct 24 03:24:30 EDT 2017


danceswithnumbers at gmail.com wrote:

> Compress this:
> 
> 4135124325
> 
> Bin to dec...still very large
> 11110110
> 01111000
> 11111101
> 01100101

Wait right there! You're cheating by dropping off leading
0 bits. The maximum value of a 10 digit decimal number is
9999999999, which in hex is 2540be3ff. That's 34 bits.
That's in line with the theoretical number of bits needed:

log2(10) * 10 = 33.219

So the binary version of your number above is really

         00
   11110110
   01111000
   11111101
   01100101

You may think you can get away without storing or
transmitting those leading 0 bits, because the decoder
can always pad out the data as needed.

But to do that, the decoder needs to know *how many*
bits to pad out to. That information somehow needs to
be part of the encoded data.

You need to imagine you're sending the data to someone
over a wire. The only thing you can send along the wire
are ones and zeroes. You can't convey any information
by timing, or turning off the power, or anything like
that. How is the receiver going to know when he's got
the whole message?

There are only two possibilities. Either you decide
in advance that all messages will be exactly the same
length, which in this case means always sending
exactly 34 bits. Or you add some extra bits to the
message -- prepend the length in binary, or add an
end-of-message code at the end, or something like
that.

Whatever you do, you'll find that *on average* you
will need *at least* 34 bits to be able to represent
all possible 10-digit decimal numbers. Some might
be shorter, but then others will be longer, and
the average won't be less than 34.

> New compression method:
> 
> 11000101
> 11000111
> 01001111
> 
> A full byte less than bin.

You need to be *very* careful about what you're claiming here.
Are you saying that your algorithm compresses *all possible*
sequences of 10 decimal digits to 3 bytes or less? Or can
some of them come out longer?

-- 
Greg



More information about the Python-list mailing list