Compression of random binary data

Steven D'Aprano steve at pearwood.info
Tue Jul 12 22:29:35 EDT 2016


On Wed, 13 Jul 2016 03:35 am, jonas.thornvall at gmail.com wrote:

> No it is only compressible down to a limit given by the algorithm.

Right! Then there is data that you can't compress.

Suppose you have some data:

data = "ABABABABABAB...ABAB"

And you compress it "down to a limit":

x = compress(compress(compress(data)))
print(x)
=> prints "@nx7%k!b"

Now let's try again with something else:

data = "AABBBCCCCDDDDEEEE...ZZZZ"

And you compress it "down to a limit":

x = compress(compress(compress(compress(data))))
print(x)
=> prints "wu*$cS#k-pv32zx[&+r"


One more time:

data = "AABBAABBAABBAABBAABB"
x = compress(data)
print(x)
=> prints "g^x3@"


We agree on this. Now you say, "Give me some random data, anything at all,
and I'll compress it!", and I run a random number generator and out pops:

data = "@nx7%k!b"

or possibly:

data = "wu*$cS#k-pv32zx[&+r"

or:

data = "g^x3@"


and I say "Compress that!"

But we've already agreed that this is as compressed as you can possibly make
it. You can't compress it any more.

So there's *at least some* random data that you can't compress. Surely you
have to accept that. You don't get to say "Oh, I don't mean *that* data, I
mean only data that I can compress". Random data means its random, you
don't get to pick and choose between data you can compress and data that
you can't.

Now the tricky part is to realise that its not just short sequences of
random data that can't be compressed. The same applies for LONG sequences
to. If I give you a gigabyte of raw video, you can probably compress that a
fair bit. That's what things like x264 encoders do. The x265 encoder is
even better. But they're lossy, so you can't reverse them.

But if I give you a gigabyte of random data, you'll be lucky to find *any*
patterns or redundancies that allow compression. You might be able to
shrink the file by a few KB. And if you take that already compressed file,
and try to compress it again, well, you've already hit the limit of
compression. There no more redundancy left to remove.

It doesn't matter how clever you are, or what a "folding structure" is, or
how many times you iterate over the data. It's a matter of absolute
simplicity: the pigeonhole principle. You can't argue with the numbers.

If you start with a 100 digit decimal number, there are 10**100 different
pigeons. If you can compress down to a 6 digit decimal number, there are
10**6 pigeon holes. You cannot put 10*100 pigeons into 10**6 pigeon holes
without doubling up (which makes your compression lossly).

So either some numbers cannot be compressed, or some numbers are compressed
to the same result, and you can't tell which was the original. That's your
choice: a lossless encoder means some numbers can't be compressed, a lossy
encoder means you can't reverse the process exactly.





-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list