Compression of random binary data

Steve D'Aprano steve+python at pearwood.info
Tue Oct 24 18:30:57 EDT 2017


On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:

> On 2017-10-23 04:21, Steve D'Aprano <steve+python at pearwood.info> wrote:
>> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>>
>> If the probability of certain codes (either single codes, or sequences of
>> codes) are non-equal, then you can take advantage of that by encoding the
>> common cases into a short representation, and the uncommon and rare cases
>> into a longer representation. As you say:
>>
>>
>>>   Otherwise, if ( 0, 0 ) is much more frequent,
>>>   we can encode ( 0, 0 ) by "0" and
>>> 
>>> ( 0, 1 ) by "101",
>>> ( 1, 0 ) by "110", and
>>> ( 1, 1 ) by "111".
>>> 
>>>   And we could then use /less/ than two bits on the
>>>   average.
>>
>> That's incorrect. On average you use 2.5 bits.
>>
>> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)
> 
> I disagree. If the distribution is not equal, then the average needs to
> take the different probabilities into account.

I think I would call that the *weighted* average rather than the average.

Regardless of what we call it, of course both you and Stefan are right in how
to calculate it, and such a variable-length scheme can be used to compress
the data.

E.g. given the sequence 00000011 which would take 8 bits in the obvious
encoding, we can encode it as "000111" which takes only 6 bits.

But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
the 8 bit sequence 11111100 is encoded as "1111111110" which requires 10
bits.

The end result is that averaged over all possible bit sequences (of a certain
size), this encoding scheme requires MORE space than the obvious 0/1 bits.

But in practice we don't care much, because the data sequences we care about
are usually not "all possible bit sequences", but a heavily restricted subset
where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list