Compression of random binary data

Gregory Ewing greg.ewing at canterbury.ac.nz
Mon Oct 23 01:08:36 EDT 2017


danceswithnumbers at gmail.com wrote:
> On Monday, July 11, 2016 at 11:52:27 AM UTC-6, jonas.t... at gmail.com wrote:

>> What is to say that you can not do it if the symbolic representation is
>> richer than the symbolic represenatation of the dataset.

The more symbols you have in your alphabet, the more bits
are needed to encode each symbol.

>> Isn't it a fact that the set of squareroots actually depict numbers in a
>> shorter way than their actual representation.

No.

> In Short, I cannot find a single mathematical proof that says you cannot
> compress random numbers. Pigeon hole and other conjectures are just that.

No, they are not conjectures, they are proofs. If you don't
recognise them as such, then you haven't fully understood them.

> In
> fact, the biggest fallacy when people start talking about compression is to
> say that all compression alg rely on redundancies, or repetitive sequences.

I think it's more correct to say they rely on the fact that
the input data is encoded *ineffiently* in some way -- it uses
more bits than are necessary to represent the information.

> The million random digits is compressed down to 415,241 kb. I have been only
> able to compress that down to 352,954 kb. A very small amount. It has taken
> six years to come up with that small amount.

You may well have an algorithm that encodes *some* million
digit sequences in less than 415,241 bytes. But I can guarantee
that there will be other million-digit sequences that it
encodes to *more* than 415,241 bytes. The average over all
possible million-digit sequences, assuming they are all
equally likely, *cannot* be less than 415,241 bytes. Anything
else would be a logical impossibility.

Note that "equally likely" is important here. That's what
"random data" means in this context -- that all possible
input sequences have the same probability.

This provides another way of thinking about how compression
works: it relies on all input sequences *not* having the
same probability. If some are more likely than others, then
you can encode the more frequent ones with less bits and
the less frequent ones with more bits, and gain overall.

-- 
Greg




More information about the Python-list mailing list