Compression of random binary data

Steve D'Aprano steve+python at pearwood.info
Tue Oct 24 07:44:57 EDT 2017


On Tue, 24 Oct 2017 09:23 pm, Ben Bacarisse wrote:

> Forget random data.  For one thing it's hard to define, 

That bit is true.

> but more importantly no one cares about it.

But that's wrong.

For instance:

- Encrypted data looks very much like random noise. With more and more data
  traversing the internet in encrypted form, the ability to compress random
  noise would be worth billions.

- Compressed data looks somewhat like random noise (with a bit of structure).
  The more it is compressed, the more random it looks. If you could compress
  random noise, you could take already compressed data, and compress it again,
  saving even more space.

- Many multimedia formats (images, sound, video) are compressed using
  dedicated encoders. The better the encoder, the more it compresses the data
  (whether lossy or not) the harder it is to compress it further. If you could
  compress random noise, you could compress JPGs, MP3s, h265-encoded MKVs,
  etc, saving even more storage and transmission costs.

And most importantly:

- Random data is a superset of the arbitrary structured data you mention
  below. If we could compress random data, then we could compress any data
  at all, no matter how much or little structure it contained.

This is why the ability to compress random data (if it were possible, which it
is not) is interesting. Its not because people want to be able to compress
last night's lottery numbers, or tables of random digits.


> By its very nature, random data is 
> not interesting.  What people want is a reversible compression algorithm
> that works on *arbitrary data* -- i.e. on *any* file at all, no matter
> how structured and *non-random* it is.

In a sense you are right. Compressing randomly generated data would be a
parlour trick and not specifically very useful. But if you had such an
algorithm, that would change the face of the world.

It would be as revolutionary and paradigm breaking as a perpetual motion
machine, or discovery of a new continent the size of China in the middle of
the Atlantic, or that π actually does equal 22/7 exactly.

And just as impossible.


> For example, run the complete works of Shakespeare through your program.
> The result is very much not random data, but that's the sort of data
> people want to compress.  If you can compress the output of your
> compressor you have made a good start.  Of course what you really want
> to be able to do is to compress the output that results from compressing
> your compressed out.  And, of course, you should not stop there. Since 
> you can compress *any* data (not just the boring random stuff) you can
> keep going -- compressing the compressed output again and again until
> you end up with a zero-length file.

Indeed.

That proof by contradiction is yet another reason we know we can't compress
random data -- that is to say, *arbitrary* data. If we had a compression
program which could guarantee to ALWAYS shrink ANY file by at least one bit,
then we could just apply it over and over again, shrinking the compressed
file again and again, until we were left with a zero-byte file:

    original.dat = 110 MB
    original.dat.zip.zip.zip.zip.zip.zip.zip = 0 MB

And then reverse the process, to turn an empty file back into the original.

But given an empty file, how do you distinguish the empty file you get
from 'music.mp3' and the identical empty file you get from 'movie.avi'?

Obviously you cannot. So we know that the only way to *guarantee* to shrink
every possible file is if the compression is lossy.


> Then you publish in a major journal.  Post the link to the journal
> article when you are done.

These days there are plenty of predatory journals which will be happy to take
Dancerswithnumber's money in return for publishing it in a junk journal.

https://en.wikipedia.org/wiki/Predatory_open_access_publishing



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list