Algorithm that makes maximum compression of completly diffused data.

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Nov 7 23:47:46 EST 2013


On Thu, 07 Nov 2013 18:43:17 -0800, Mark Janssen wrote:

>>> I am not sure if it is just stupidness or laziness that prevent you
>>> from seeing that 4^8=65536.
>>
>> I can see that 4^8 = 65536. Now how are you going to render 65537? You
>> claimed that you could render *any* number efficiently. What you've
>> proven is that a small subset of numbers can be rendered efficiently.
> 
> I think the idea would be to find the prime factorization for a given
> number, which has been proven to be available (and unique) for any and
> every number.  Most numbers can compress given this technique.

Sadly, they would not.

Let's suppose you wish to compress the number 143. So you find the prime 
factorization, 11*13=143. Have you compressed anything? No you have not 
-- the original number, 143, fits in a single byte, while it requires 
*two* bytes to deal with 11 and 13 (one byte for 11, and a second byte 
for 13).

We can try a different strategy: while 143 requires a full eight bits 
(one byte), both 11 and 13 require only four bits (one nybble) each. So 
by hard coding our expectation that the result will be two nybbles, we 
can avoid expanding the data and end up with exactly zero compression:

143 = binary 10001110 or eight bits in total
11, 13 = binary 1011 1101 or eight bits in total

But while that trick works for 143, it doesn't work for 154 (one byte, 
binary 10011010) which has prime factorization 2*7*11 (three nybbles 0010 
0111 1011). Even if you can somehow drop the leading zeroes, it requires 
nine bits versus eight.

Suppose instead of using bytes, you use decimal digits. 11 and 13 use 
four digits, while 143 uses only three. Likewise, 154 has three decimal 
digits while 2*7*11 has four digits. In both cases, there is no 
compression.

How about recording which prime number it is, instead of the actual value 
of the prime? So 154=2*7*11 expands to the 1st, 4th and 5th prime, so 
maybe we can record 1 4 5 instead of 2 7 11 and save some bits. (We'll 
save more bits the larger the prime.) Of course, to reverse the 
compression you need to keep a table of the prime numbers somewhere, and 
that's going to be a lot bigger than the space you save...

Now, I accept that, possibly, with sufficient work we might be able to 
find some cunning mechanism by which we can compress *any* integer value. 
But it won't be the same mechanism for every value! For example, we might 
compress (2**1000+1) using a run-length encoding of the binary bits, 
while compressing 14629839399572435720381495231 as its prime 
factorization (the 319th prime number, the 479th prime, the 499th prime 
six times), and 10**1000 as an encoding of the decimal digits. But we 
still need some sort of tag to distinguish between these different 
compression schemes, and once you take into account the extra information 
in the tag, there will be cases where some numbers aren't compressed at 
all.

The ability to losslessly compress *everything* is sheer pseudo-
mathematics, up there with finding an exact rational value for pi, or 
squaring the circle using only a compass and straight-edge. But the 
ability to losslessly compress *useful data* is real, and such algorithms 
operate by finding non-random patterns and redundant information in the 
data.



-- 
Steven



More information about the Python-list mailing list