Compression of random binary data

Steven D'Aprano steve at pearwood.info
Tue Jul 12 11:11:39 EDT 2016


On Wed, 13 Jul 2016 12:24 am, jonas.thornvall at gmail.com wrote:

> Den måndag 11 juli 2016 kl. 20:38:51 UTC+2 skrev Steven D'Aprano:
>> On Tue, 12 Jul 2016 03:52 am, jonas.thornvall at gmail.com wrote:
>> 
>> > What kind of statistic law or mathematical conjecture  or is it even a
>> > physical law is violated by compression of random binary data?
>> 
>> The pigeon hole principle. If you have 100 pigeon holes, and 101 pigeons,
>> then clearly at least one pigeon hole must have two pigeons in it.
[...]
> But it seem your reasoning is based upon interpretation of the actual
> digits, bits and bytes value.

Not at all. If you think that, you've misread my example. There's no
interpretation of the bytes: they are just 8-bit numbers from 0 to 255. You
cannot losslessly compress all 256 of them to just four 2-bit numbers.


> There could be different interpretation 
> worlds of course you would have to chose one using digits, An 
> interpretationworld here could be reading out different word lengths of
> the dataset and maybe a lookup table.

Any lookup table you have counts as part of the compressed data.


> But it could also be arithmetic rules that magically recreate a number
> from a number of folds or difference of folds.

Oh, sure, if you believe in magic, anything is possible. Just close your
eyes, click your heels together, and wish really, really hard.

Suppose I could compress ANY random data, no matter what, down to 10% of the
original size. Okay, let's start with a million bits of data. Compress it
down to 100,000 bits.

But I believe that I can compress *anything*, any random collection of data.
Okay, let me compress it again. Now I have 10,000 bits.

Compress it again. Now I have 1,000 bits.

Compress it again. Now I have 100 bits.

Compress it again. Now I have 10 bits.

Compress it again. Now I have 1 bit, either a 0 or a 1.


Can you not see how absurd this is? I have claimed that I can take *any*
random set of data, and by compressing it again and again and again,
compress it down to ONE BIT, either a 0 or a 1, WITHOUT LOSS. Somehow I
have to take that 0 bit and uncompress it back to the Complete Works Of
William Shakespeare, and *also* uncompress it back to the recent Deadpool
movie, AND uncompress it back to last year's Ant Man movie, AND uncompress
it back to some funny picture of a cat.

How can I possibly know which of the billions and billions of different
files this 0 bit represents?

If you pass me a 0 bit, and say "uncompress this", and I get The Lord Of The
Rings novels, and then you pass me another 0 bit, and I uncompress it and
get The Hobbit, well, how did I tell the two bits apart? They're both zero.



The alternative is to say, it doesn't matter how clever you are, you can't
compress *everything*. There are some things that simply won't compress.
Eventually you get something no longer compresses. If you could compress
EVERYTHING, then you could compress the compressed data, and compress the
compressed-compressed data, and so on, until you've got only a single bit.
And that is ridiculous.



-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list