Compression of random binary data

Tue Oct 24 19:15:28 EDT 2017

On 10/24/17 6:30 PM, Steve D'Aprano wrote:
> On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:
> 
>> On 2017-10-23 04:21, Steve D'Aprano <steve+python at pearwood.info> wrote:
>>> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>>>
>>> If the probability of certain codes (either single codes, or sequences of
>>> codes) are non-equal, then you can take advantage of that by encoding the
>>> common cases into a short representation, and the uncommon and rare cases
>>> into a longer representation. As you say:
>>>
>>>
>>>>    Otherwise, if ( 0, 0 ) is much more frequent,
>>>>    we can encode ( 0, 0 ) by "0" and
>>>>
>>>> ( 0, 1 ) by "101",
>>>> ( 1, 0 ) by "110", and
>>>> ( 1, 1 ) by "111".
>>>>
>>>>    And we could then use /less/ than two bits on the
>>>>    average.
>>>
>>> That's incorrect. On average you use 2.5 bits.
>>>
>>> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)
>>
>> I disagree. If the distribution is not equal, then the average needs to
>> take the different probabilities into account.
> 
> I think I would call that the *weighted* average rather than the average.
> 
> Regardless of what we call it, of course both you and Stefan are right in how
> to calculate it, and such a variable-length scheme can be used to compress
> the data.
> 
> E.g. given the sequence 00000011 which would take 8 bits in the obvious
> encoding, we can encode it as "000111" which takes only 6 bits.
> 
> But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
> the 8 bit sequence 11111100 is encoded as "1111111110" which requires 10
> bits.
> 
> The end result is that averaged over all possible bit sequences (of a certain
> size), this encoding scheme requires MORE space than the obvious 0/1 bits.
> 
> But in practice we don't care much, because the data sequences we care about
> are usually not "all possible bit sequences", but a heavily restricted subset
> where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.
> 

My understanding of the 'Random Data Comprehensibility' challenge is 
that is requires that the compression take ANY/ALL strings of up to N 
bits, and generate an output stream no longer than the input stream, and 
sometime less. It admits that given no-uniformly distributed data, it is 
possible to compress some patterns, the common ones, and expand others, 
the uncommon ones, to lower the net average length. What it says can't 
be done is to have a compression method that compresses EVERY input 
pattern. That is where the 'Pigeon Hole' principle comes into play which 
the people who claim to be able to compress random data like to ignore 
or just attempt to say doesn't apply.