Compression of random binary data

Tue Oct 24 16:09:56 EDT 2017

On 2017-10-23 04:21, Steve D'Aprano <steve+python at pearwood.info> wrote:
> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>
> If the probability of certain codes (either single codes, or sequences of
> codes) are non-equal, then you can take advantage of that by encoding the
> common cases into a short representation, and the uncommon and rare cases
> into a longer representation. As you say:
>
>
>>   Otherwise, if ( 0, 0 ) is much more frequent,
>>   we can encode ( 0, 0 ) by "0" and
>> 
>> ( 0, 1 ) by "101",
>> ( 1, 0 ) by "110", and
>> ( 1, 1 ) by "111".
>> 
>>   And we could then use /less/ than two bits on the
>>   average. 
>
> That's incorrect. On average you use 2.5 bits.
>
> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)

I disagree. If the distribution is not equal, then the average needs to
take the different probabilities into account. 

Let's assume that (0, 0) has a probability of 90 %, (0, 1) a probability
of 10 % and (1, 0) and (1, 1) a probability of 5 % each. 

Then the average length is 

    0.9 * 1 bit + 0.1 * 3 bits + 0.05 * 3 bits + 0.05 * 3 bits = 1.5 bits.

        hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   | hjp at hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel