Compression of random binary data

Tue Oct 24 09:29:02 EDT 2017

Steve D'Aprano <steve+python at pearwood.info> writes:

> On Tue, 24 Oct 2017 09:23 pm, Ben Bacarisse wrote:
>
>> Forget random data.  For one thing it's hard to define, 
>
> That bit is true.
>
>> but more importantly no one cares about it.
>
> But that's wrong.

All generalisations are false.  I was being hyperbolic.

> For instance:
>
> - Encrypted data looks very much like random noise. With more and more data
>   traversing the internet in encrypted form, the ability to compress random
>   noise would be worth billions.
>
> - Compressed data looks somewhat like random noise (with a bit of structure).
>   The more it is compressed, the more random it looks. If you could compress
>   random noise, you could take already compressed data, and compress it again,
>   saving even more space.
>
> - Many multimedia formats (images, sound, video) are compressed using
>   dedicated encoders. The better the encoder, the more it compresses the data
>   (whether lossy or not) the harder it is to compress it further. If you could
>   compress random noise, you could compress JPGs, MP3s, h265-encoded MKVs,
>   etc, saving even more storage and transmission costs.

But these are not random data.  We care about these because they are are
highly structured, non-random data.

> And most importantly:
>
> - Random data is a superset of the arbitrary structured data you mention
>   below. If we could compress random data, then we could compress any data
>   at all, no matter how much or little structure it contained.

Yes, that's part of my point.  Arbitrary data includes random data but
it avoids arguments about what random means.

> This is why the ability to compress random data (if it were possible, which it
> is not) is interesting. Its not because people want to be able to compress
> last night's lottery numbers, or tables of random digits.

The trouble is a pedagogic one.  Saying "you can't compress random data"
inevitably leads (though, again, this is just my experience) to endless
attempts to define random data.  My preferred way out of that is to talk
about algorithmic complexity but for your average "I've got a perfect
compression algorithm" poster, that is step too far.

I think "arbitrary data" (thereby including the results of compression
by said algorithm) is the best way to make progress.

<snip>
>> Then you publish in a major journal.  Post the link to the journal
>> article when you are done.
>
> These days there are plenty of predatory journals which will be happy to take
> Dancerswithnumber's money in return for publishing it in a junk
> journal.

Sure, but you usually get a huge advantage -- a text to criticise.  Your
average Usenet crank will keep changing what they say to avoid being
pinned down.  Plus you get to note the fact that the journal is junk.

-- 
Ben.