Compression of random binary data

jonas.thornvall at gmail.com jonas.thornvall at gmail.com
Wed Jul 13 06:14:36 EDT 2016


Den onsdag 13 juli 2016 kl. 12:05:03 UTC+2 skrev jonas.t... at gmail.com:
> Den onsdag 13 juli 2016 kl. 04:29:48 UTC+2 skrev Steven D'Aprano:
> > On Wed, 13 Jul 2016 03:35 am, jonas.thornvall at gmail.com wrote:
> > 
> > > No it is only compressible down to a limit given by the algorithm.
> > 
> > Right! Then there is data that you can't compress.
> > 
> > Suppose you have some data:
> > 
> > data = "ABABABABABAB...ABAB"
> > 
> > And you compress it "down to a limit":
> > 
> > x = compress(compress(compress(data)))
> > print(x)
> > => prints "@nx7%k!b"
> > 
> > Now let's try again with something else:
> > 
> > data = "AABBBCCCCDDDDEEEE...ZZZZ"
> > 
> > And you compress it "down to a limit":
> > 
> > x = compress(compress(compress(compress(data))))
> > print(x)
> > => prints "wu*$cS#k-pv32zx[&+r"
> > 
> > 
> > One more time:
> > 
> > data = "AABBAABBAABBAABBAABB"
> > x = compress(data)
> > print(x)
> > => prints "g^x3@"
> > 
> > 
> > We agree on this. Now you say, "Give me some random data, anything at all,
> > and I'll compress it!", and I run a random number generator and out pops:
> > 
> > data = "@nx7%k!b"
> > 
> > or possibly:
> > 
> > data = "wu*$cS#k-pv32zx[&+r"
> > 
> > or:
> > 
> > data = "g^x3@"
> > 
> > 
> > and I say "Compress that!"
> > 
> > But we've already agreed that this is as compressed as you can possibly make
> > it. You can't compress it any more.
> > 
> > So there's *at least some* random data that you can't compress. Surely you
> > have to accept that. You don't get to say "Oh, I don't mean *that* data, I
> > mean only data that I can compress". Random data means its random, you
> > don't get to pick and choose between data you can compress and data that
> > you can't.
> > 
> > Now the tricky part is to realise that its not just short sequences of
> > random data that can't be compressed. The same applies for LONG sequences
> > to. If I give you a gigabyte of raw video, you can probably compress that a
> > fair bit. That's what things like x264 encoders do. The x265 encoder is
> > even better. But they're lossy, so you can't reverse them.
> > 
> > But if I give you a gigabyte of random data, you'll be lucky to find *any*
> > patterns or redundancies that allow compression. You might be able to
> > shrink the file by a few KB. And if you take that already compressed file,
> > and try to compress it again, well, you've already hit the limit of
> > compression. There no more redundancy left to remove.
> > 
> > It doesn't matter how clever you are, or what a "folding structure" is, or
> > how many times you iterate over the data. It's a matter of absolute
> > simplicity: the pigeonhole principle. You can't argue with the numbers.
> > 
> > If you start with a 100 digit decimal number, there are 10**100 different
> > pigeons. If you can compress down to a 6 digit decimal number, there are
> > 10**6 pigeon holes. You cannot put 10*100 pigeons into 10**6 pigeon holes
> > without doubling up (which makes your compression lossly).
> > 
> > So either some numbers cannot be compressed, or some numbers are compressed
> > to the same result, and you can't tell which was the original. That's your
> > choice: a lossless encoder means some numbers can't be compressed, a lossy
> > encoder means you can't reverse the process exactly.
> > 
> > 
> > 
> > 
> > 
> > -- 
> > Steven
> > “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> > enough, things got worse.
> 
> Ok, try to see it this way ****very big**** numbers can be described as the sum or difference between a sequense of a few polynomials. Unfortunately we lack the computational skill/computing power to find them.
> 
> That is not the case using foldings/geometric series.

The sum or difference of a few small polynomials would still of course be a polynomial. But as i say it is a case of enrich the interpretation of the symbolic set that you look at you replace digits bits/integers with arithmetic describing them.



More information about the Python-list mailing list