Compression of random binary data

jonas.thornvall at gmail.com jonas.thornvall at gmail.com
Wed Jul 13 06:04:43 EDT 2016


Den onsdag 13 juli 2016 kl. 04:29:48 UTC+2 skrev Steven D'Aprano:
> On Wed, 13 Jul 2016 03:35 am, jonas.thornvall at gmail.com wrote:
> 
> > No it is only compressible down to a limit given by the algorithm.
> 
> Right! Then there is data that you can't compress.
> 
> Suppose you have some data:
> 
> data = "ABABABABABAB...ABAB"
> 
> And you compress it "down to a limit":
> 
> x = compress(compress(compress(data)))
> print(x)
> => prints "@nx7%k!b"
> 
> Now let's try again with something else:
> 
> data = "AABBBCCCCDDDDEEEE...ZZZZ"
> 
> And you compress it "down to a limit":
> 
> x = compress(compress(compress(compress(data))))
> print(x)
> => prints "wu*$cS#k-pv32zx[&+r"
> 
> 
> One more time:
> 
> data = "AABBAABBAABBAABBAABB"
> x = compress(data)
> print(x)
> => prints "g^x3@"
> 
> 
> We agree on this. Now you say, "Give me some random data, anything at all,
> and I'll compress it!", and I run a random number generator and out pops:
> 
> data = "@nx7%k!b"
> 
> or possibly:
> 
> data = "wu*$cS#k-pv32zx[&+r"
> 
> or:
> 
> data = "g^x3@"
> 
> 
> and I say "Compress that!"
> 
> But we've already agreed that this is as compressed as you can possibly make
> it. You can't compress it any more.
> 
> So there's *at least some* random data that you can't compress. Surely you
> have to accept that. You don't get to say "Oh, I don't mean *that* data, I
> mean only data that I can compress". Random data means its random, you
> don't get to pick and choose between data you can compress and data that
> you can't.
> 
> Now the tricky part is to realise that its not just short sequences of
> random data that can't be compressed. The same applies for LONG sequences
> to. If I give you a gigabyte of raw video, you can probably compress that a
> fair bit. That's what things like x264 encoders do. The x265 encoder is
> even better. But they're lossy, so you can't reverse them.
> 
> But if I give you a gigabyte of random data, you'll be lucky to find *any*
> patterns or redundancies that allow compression. You might be able to
> shrink the file by a few KB. And if you take that already compressed file,
> and try to compress it again, well, you've already hit the limit of
> compression. There no more redundancy left to remove.
> 
> It doesn't matter how clever you are, or what a "folding structure" is, or
> how many times you iterate over the data. It's a matter of absolute
> simplicity: the pigeonhole principle. You can't argue with the numbers.
> 
> If you start with a 100 digit decimal number, there are 10**100 different
> pigeons. If you can compress down to a 6 digit decimal number, there are
> 10**6 pigeon holes. You cannot put 10*100 pigeons into 10**6 pigeon holes
> without doubling up (which makes your compression lossly).
> 
> So either some numbers cannot be compressed, or some numbers are compressed
> to the same result, and you can't tell which was the original. That's your
> choice: a lossless encoder means some numbers can't be compressed, a lossy
> encoder means you can't reverse the process exactly.
> 
> 
> 
> 
> 
> -- 
> Steven
> “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> enough, things got worse.

Ok, try to see it this way ****very big**** numbers can be described as the sum or difference between a sequense of a few polynomials. Unfortunately we lack the computational skill/computing power to find them.

That is not the case using foldings/geometric series.



More information about the Python-list mailing list