Compression of random binary data

Thu Oct 26 04:22:31 EDT 2017

On 2017-10-24 22:30, Steve D'Aprano <steve+python at pearwood.info> wrote:
> On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:
>
>> On 2017-10-23 04:21, Steve D'Aprano <steve+python at pearwood.info> wrote:
>>> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>>>
>>> If the probability of certain codes (either single codes, or sequences of
>>> codes) are non-equal, then you can take advantage of that by encoding the
>>> common cases into a short representation, and the uncommon and rare cases
>>> into a longer representation. As you say:
>>>
>>>
>>>>   Otherwise, if ( 0, 0 ) is much more frequent,
>>>>   we can encode ( 0, 0 ) by "0" and
>>>> 
>>>> ( 0, 1 ) by "101",
>>>> ( 1, 0 ) by "110", and
>>>> ( 1, 1 ) by "111".
>>>> 
>>>>   And we could then use /less/ than two bits on the
>>>>   average.
>>>
>>> That's incorrect. On average you use 2.5 bits.
>>>
>>> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)
>> 
>> I disagree. If the distribution is not equal, then the average needs to
>> take the different probabilities into account.
>
> I think I would call that the *weighted* average rather than the average.

If there are 4 guys who are 180 cm tall and one who is 2 metres tall,
would you say their average height is 184 cm ((180 * 4 + 200 * 1)/(4 + 1))
or 190 cm ((180 + 200)/2)? For me that's the same situation.

> Regardless of what we call it, of course both you and Stefan are right in how
> to calculate it, and such a variable-length scheme can be used to compress
> the data.
>
> E.g. given the sequence 00000011 which would take 8 bits in the obvious
> encoding, we can encode it as "000111" which takes only 6 bits.
>
> But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
> the 8 bit sequence 11111100 is encoded as "1111111110" which requires 10
> bits.

Right. This is most obvious in Huffman encoding, where each symbol is
replaced by a sequence of bits which is directly related to the
frequency of that symbol. So the letter 'e' might be encoded in 3 or 4
bits (in a corpus of text I happen to have lying around, about 1 in 11
characters is an e and ld(11) = 3.43) while the tilde (about 1 character
in 21000) would be encoded in 14 or 15 bits.

That's basically how all lossless compression algorithms work: Replacing
more frequent bit sequences with shorter sequences and replacing less
frequent sequences with longer ones. (Although most operate on variable
length byte sequences not individual bytes. I find the LZW algorithm (as
used in Unix compress and GIF images) a particularly instructive
example).

> The end result is that averaged over all possible bit sequences (of a certain
> size), this encoding scheme requires MORE space than the obvious 0/1 bits.
>
> But in practice we don't care much, because the data sequences we care about
> are usually not "all possible bit sequences", but a heavily restricted subset
> where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.

Right.

        hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) |                    | Man feilt solange an seinen Text um, bis
| |   | hjp at hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel