Compression of random binary data

Steven D'Aprano steve at pearwood.info
Mon Jul 11 14:38:27 EDT 2016


On Tue, 12 Jul 2016 03:52 am, jonas.thornvall at gmail.com wrote:

> What kind of statistic law or mathematical conjecture  or is it even a
> physical law is violated by compression of random binary data?

The pigeon hole principle. If you have 100 pigeon holes, and 101 pigeons,
then clearly at least one pigeon hole must have two pigeons in it.

To keep the numbers small and manageable, let's say we are going to compress
one byte at a time. Now a byte has eight bits, so there are exactly 256
possible bytes:

0000 0000
0000 0001
0000 0010
...
1111 1110
1111 1111

Now, suppose I claim that I can LOSSLESSLY (that is, reversibly) compress
any random byte to just two bits. The lossless part is important: its not
hard to compress random data by irreversibly throwing some of it away, and
there's no violation there.

So I claim that you can give me any random byte, I will compress it to just
two bits:

00
01
10
11

and then be able to decompress it back again to give you the original byte
once more.

Obviously I'm trying to pull a fast one! There's no way I can do this. I can
squeeze 256 pigeons into just four pigeon holes, but only by doubling them
up. Suppose I compress these three bytes to 00:

0000 0000
0110 1001
1100 0110

Now when I go to uncompress 00, what should I return? There is no way for me
to know which of the three was the original value.

(If I'm cunning, I'll have sneakily stored some data *elsewhere*, say, in
the file name, or in a database, so that you need this extra hidden data to
uncompress the 00 back to a full byte. But then I'm not compressing eight
bits down to two. I'm compressing eight bits down to two bits plus
who-knows-how-many-bits of hidden data.)

So the pigeon hole principle tells us one of two things:

(1) If you compress random data, then it must be lossy; I can compress eight
bits to two, but then I can't uncompress it back again, at least not
without throwing away some data.

(2) Or, if the compression is lossless, then some data must be expanded
rather than compressed. If you pick data at random, some of it will be
expanded.

Suppose I have a compression algorithm that infallibly and reversibly
compresses as follows:

0000 0000 <--> 00
0000 0001 <--> 01
0000 0010 <--> 10
0000 0011 <--> 11

That part is fine. But what will my algorithm do with the other 252 bytes?
At *best* it will leave them untouched:

0000 0100 <--> 0000 0100
...
1111 1111 <--> 1111 1111

which is no compression at all, but at worst it will actually expand them
and make them bigger. (After all, it's likely that my compression format
has at least a bit of overhead.)

In practice, compression algorithms are designed to look for particular
kinds of order or structure in the data, and compress that. That's fine for
the sorts of non-random data we care about: pictures are rarely pictures of
static, text files are rarely random collections of bits. But if you do
throw a random set of bits at a lossless compression algorithm, it will at
best not compress it at all, and at worst actually make the file bigger.


> What is to say that you can not do it if the symbolic representation is
> richer than the symbolic represenatation of the dataset.
> 
> Isn't it a fact that the set of squareroots actually depict numbers in a
> shorter way than their actual representation.

Sure. But you need to know what √2 means. It *represents* the number
1.41421356237... but doesn't compress it. There's nothing you can do to the
symbol √2 that will uncompress back to the infinite series of digits. All
you can do is look it up somewhere to see what the digits are.

> Now the inpretator or program must know the rules. And i have very good
> rules to make it happen.

Right. How much information is in the rules? More than you save with
the "compression". Consider:

1.41421356237 compressed down to √2, that's 13 characters down to 2. Great!
But to *uncompress*, you need to store a rule:

√2=1.41421356237 

and that's *sixteen* characters. So your "compression" is:

original: 13
compressed to: 2
plus rule: 16

means you have "compressed" 13 characters to 18.

Now, this is still worth doing if you need to repeat the √2 many times, so
long as you don't have to repeat the rule. That's useful. But it's not
compression. It's more like keeping an index to a database, or a scrap of
paper with the title of a book written on it:

"See Lord Of The Rings, by J.R.R. Tolkien"

That's a lot smaller than the actual book: eight words, instead of who knows
how many tens of thousands. But you can't call it compression: you can't
sit down with the scrap of paper *and nothing else* and uncompress it back
to the entire LOTR trilogy.




-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list