Compression of random binary data

Steve D'Aprano steve+python at pearwood.info
Mon Oct 23 01:39:35 EDT 2017


On Mon, 23 Oct 2017 08:22 am, danceswithnumbers at gmail.com wrote:

> In Short, I cannot find a single mathematical proof that says you cannot
> compress random numbers. 

Three seconds of googling finds this:

http://www.faqs.org/faqs/compression-faq/part1/section-8.html

which demonstrates a simple, trivial proof.

It took me a little longer to find this:

https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compression

but only because I couldn't remember how to spell Kolmogorov, and that lead me
to this:

https://en.wikipedia.org/wiki/Incompressible_string

for a concrete example.

Of course *occasionally* one will, by pure luck, compress a stream of random
bits. I run my random number generator and by a fluke it generates a stream
of 50 zeroes in a row:

00000000000000000000000000000000000000000000000000

which even a simple run-length encoding scheme can compress:

(50*0)


But flukes are not what people mean when they talk about compressing random
data. They mean some algorithm which can:


- transparently (no hiding data in the decompressor, or the file name, 
  or in a secret file hidden on the disk, or anywhere else)

- reversibly (there must be a decompressor which when given ONLY the
  compressed file, and NOTHING else, reconstruct the original)

- losslessly (no throwing away data: we must reconstruct the EXACT
  original, not merely something "close enough")

- and *consistently* (flukes don't count: every input must be compressed)


compress random data of size N bits to at most N-1 bits.

We can prove that this is impossible by simply *counting* how many random
strings there are.

Let us pick a concrete example, N=8. Then there are 2**8 = 256 possible
strings. We need to compress down to at most 7 bits. If we allow the output
to vary in length, there are:

2**7 = 128 seven bit strings;
2**6 = 64 six bit strings;
2**5 = 32 five bit strings;
2**4 = 16 four bit strings;
2**3 = 8 three bit strings;
2**2 = 4 two bit strings;
2**1 = 2 one bit strings;
2**0 = 1 zero bit strings

which add up to 255 compressed strings in total. But there are 256 possible
inputs, and only 255 possible outputs: so at least one input cannot be
compressed, or the compression must be lossy (two inputs compress to the same
output).




By the way: here is a very clever trick for hiding information in the file
system:

http://www.patrickcraig.co.uk/other/compression.php


but as people point out, the information in the file, plus the information in
the file system, ends up being *larger* than the original. Well done to
Patrick Craig for finding a clever loophole to "win" the challenge, but he
did so without actually compressing the original data.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list