Algorithm that makes maximum compression of completly diffused data.

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Nov 2 23:17:57 EDT 2013


On Sat, 02 Nov 2013 14:31:09 -0700, Tim Roberts wrote:

> jonas.thornvall at gmail.com wrote:
>>
>>Well then i have news for you.
> 
> Well, then, why don't you share it?
> 
> Let me try to get you to understand WHY what you say is impossible.
[snip reasons]

Expanding on Tim's post... the first scenario Tim gives, applying the 
compression repeatedly to its own output until you eventually get a 
single byte, can be overcome if there are data sets that are unchanged by 
compression. That is, if f() is the compression function:

f(f(f(data))) = f(f(data)) == f(data) == data

for some values of data. So if you start with some other data, and 
compress it, then compress it again, and again, eventually you end up 
with one of these attractors, after which repeated compression doesn't 
give you any more benefit.

[Aside: the attractors aren't necessarily a single point. The attractor 
could be a cycle of two or more points, f(x) == y and f(y) == x. Or you 
could even have a "strange attractor", which brings us to chaos theory.] 

But your second reason, better known as the pigeonhole principle, 
demonstrates that for any lossless compression method, there must be data 
sets that actually expand the data. It doesn't matter how cleverly you 
compress the data, you can't fit 20kg of potatoes in a 10kg bag, so to 
speak. Suppose your compression algorithm compresses a single byte into a 
nybble (four bits). There are 256 different input data sets (0x00, 
0x01, ... 0xFF) and only 16 different outputs (0x0, 0x1, ... 0xF). There 
is no way for 256 pigeons to fit in 16 pigeon holes unless you put two or 
more pigeons in at least one hole. Ergo, if the compression algorithm is 
lossless, *some* data must be expanded rather than compressed.

Alternatively, the compression may be lossy. Or both!

Obviously data isn't compressed a byte at a time, that would be silly. 
But the principle still applies.

There is a way to apparently get around these limits: store data 
externally, perhaps inside the compression application itself. Then, if 
you just look at the compressed file (the "data.zip" equivalent, although 
I stress that zip compression is *not* like this), you might think it has 
shrunk quite a lot. But when you include the hidden data, the compression 
is not quite so impressive...


-- 
Steven



More information about the Python-list mailing list