Algorithm that makes maximum compression of completly diffused data.

Sun Nov 3 10:34:07 EST 2013

On 3 November 2013 03:17, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> On Sat, 02 Nov 2013 14:31:09 -0700, Tim Roberts wrote:
>
>> jonas.thornvall at gmail.com wrote:
>>>
>>>Well then i have news for you.
>>
>> Well, then, why don't you share it?
>>
>> Let me try to get you to understand WHY what you say is impossible.
> [snip reasons]
>
> But your second reason, better known as the pigeonhole principle,
> demonstrates that for any lossless compression method, there must be data
> sets that actually expand the data. It doesn't matter how cleverly you
> compress the data, you can't fit 20kg of potatoes in a 10kg bag, so to
> speak. Suppose your compression algorithm compresses a single byte into a
> nybble (four bits). There are 256 different input data sets (0x00,
> 0x01, ... 0xFF) and only 16 different outputs (0x0, 0x1, ... 0xF). There
> is no way for 256 pigeons to fit in 16 pigeon holes unless you put two or
> more pigeons in at least one hole. Ergo, if the compression algorithm is
> lossless, *some* data must be expanded rather than compressed.

You have to be careful to make this totally rigorous, too.

Note that I *can* make a "compression" algorithm that takes any
length-n sequence and compresses all but one sequence by at least one
bit, and does not ever expand the data.

"00" -> ""
"01" -> "0"
"10" -> "1"
"11" -> "00"

This, obviously, is just 'cause the length is an extra piece of data,
but sometimes you have to store that anyway ;). So if I have a list of
N length-Y lists containing only 1s or 0s, I can genuinely compress
the whole structure by N log2 Y items.