Compression of random binary data

Steve D'Aprano steve+python at pearwood.info
Mon Oct 23 00:21:33 EDT 2017


On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:

> ram at zedat.fu-berlin.de (Stefan Ram) writes:
>>When we have a source of 2 random binary digits,
>>there are 2^2 possible outcomes:
>>( 0, 0 ),
>>( 0, 1 ),
>>( 1, 0 ), and
>>( 1, 1 ).
>>. One cannot aggree upon a code to represent each
>>of those four possibilities with just one bit.
> 
>   If the probabilities of each outcome is 0.25.

When people talk about "random data", they are normally referring to equal
probabilities. Just as when they talk about compression in this context, they
mean lossless, reversible compression.

If the probability of certain codes (either single codes, or sequences of
codes) are non-equal, then you can take advantage of that by encoding the
common cases into a short representation, and the uncommon and rare cases
into a longer representation. As you say:


>   Otherwise, if ( 0, 0 ) is much more frequent,
>   we can encode ( 0, 0 ) by "0" and
> 
> ( 0, 1 ) by "101",
> ( 1, 0 ) by "110", and
> ( 1, 1 ) by "111".
> 
>   And we could then use /less/ than two bits on the
>   average. 

That's incorrect. On average you use 2.5 bits.

(1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)

Where you can actually get compression is if you limit yourself to only
encoding specially selected input data where the code pairs are non-random
and you have many more (0,0) than other combinations. But that is highly
non-random.

For example, if we are compressing a black and white bitmap, say a fax, the
most common data is a page which is mostly white with only a bit of black
text. That gives us a bitmap with lots and lots of (0,0) pairs and relatively
few of the other combinations, which means for the common case of "a page of
black text on a white background" the compression works well.

(I'm treating 0 as white and 1 as black.)

But for purely random input, there will be roughly equal numbers of each pair
of bit and on average the encoding will expand the data by 25% (two bits of
input maps to 2.5 bits output, on average).



>   But two bits are (maximally )random if 
>   the probability of every combination /is/ 0.25.

In the example you give, we have four possible pairs of bytes:


Input:  (0,0) (0,1) (1,0) (1,1)  # 2 bits on average
Output: "0"   "101" "110" "111"  # 2.5 bits on average


So if the incoming data is random (uniformly-distributed), this will expand
the size of the data by 25%.

Only if the data is non-random and biased heavily towards (0,0) will you see
compression savings.

Of course in the real world most data we want to compress is non-random, and
we can get really impressive compressions. But the cost of that is that you
can rarely compress data twice (if you do, you either get very little savings
the second time, or it expands) and you cannot compress random (uniformly
distributed) data.

>   (If other combinations than ( 0, 0 ) were extremely
>   rare, one could even improve the compression more.)

What you say is true, but it has nothing to do with Danceswithnumbers'
ignorant and ludicrous claim that he can compress random data and that there
is no mathematical proof that you cannot.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list