Compression of random binary data

jonas.thornvall at gmail.com jonas.thornvall at gmail.com
Tue Jul 12 10:24:06 EDT 2016


Den måndag 11 juli 2016 kl. 20:38:51 UTC+2 skrev Steven D'Aprano:
> On Tue, 12 Jul 2016 03:52 am, jonas.thornvall at gmail.com wrote:
> 
> > What kind of statistic law or mathematical conjecture  or is it even a
> > physical law is violated by compression of random binary data?
> 
> The pigeon hole principle. If you have 100 pigeon holes, and 101 pigeons,
> then clearly at least one pigeon hole must have two pigeons in it.
> 
> To keep the numbers small and manageable, let's say we are going to compress
> one byte at a time. Now a byte has eight bits, so there are exactly 256
> possible bytes:
> 
> 0000 0000
> 0000 0001
> 0000 0010
> ...
> 1111 1110
> 1111 1111
> 
> Now, suppose I claim that I can LOSSLESSLY (that is, reversibly) compress
> any random byte to just two bits. The lossless part is important: its not
> hard to compress random data by irreversibly throwing some of it away, and
> there's no violation there.
> 
> So I claim that you can give me any random byte, I will compress it to just
> two bits:
> 
> 00
> 01
> 10
> 11
> 
> and then be able to decompress it back again to give you the original byte
> once more.
> 
> Obviously I'm trying to pull a fast one! There's no way I can do this. I can
> squeeze 256 pigeons into just four pigeon holes, but only by doubling them
> up. Suppose I compress these three bytes to 00:
> 
> 0000 0000
> 0110 1001
> 1100 0110
> 
> Now when I go to uncompress 00, what should I return? There is no way for me
> to know which of the three was the original value.
> 
> (If I'm cunning, I'll have sneakily stored some data *elsewhere*, say, in
> the file name, or in a database, so that you need this extra hidden data to
> uncompress the 00 back to a full byte. But then I'm not compressing eight
> bits down to two. I'm compressing eight bits down to two bits plus
> who-knows-how-many-bits of hidden data.)
> 
> So the pigeon hole principle tells us one of two things:
> 
> (1) If you compress random data, then it must be lossy; I can compress eight
> bits to two, but then I can't uncompress it back again, at least not
> without throwing away some data.
> 
> (2) Or, if the compression is lossless, then some data must be expanded
> rather than compressed. If you pick data at random, some of it will be
> expanded.
> 
> Suppose I have a compression algorithm that infallibly and reversibly
> compresses as follows:
> 
> 0000 0000 <--> 00
> 0000 0001 <--> 01
> 0000 0010 <--> 10
> 0000 0011 <--> 11
> 
> That part is fine. But what will my algorithm do with the other 252 bytes?
> At *best* it will leave them untouched:
> 
> 0000 0100 <--> 0000 0100
> ...
> 1111 1111 <--> 1111 1111
> 
> which is no compression at all, but at worst it will actually expand them
> and make them bigger. (After all, it's likely that my compression format
> has at least a bit of overhead.)
> 
> In practice, compression algorithms are designed to look for particular
> kinds of order or structure in the data, and compress that. That's fine for
> the sorts of non-random data we care about: pictures are rarely pictures of
> static, text files are rarely random collections of bits. But if you do
> throw a random set of bits at a lossless compression algorithm, it will at
> best not compress it at all, and at worst actually make the file bigger.
> 
> 
> > What is to say that you can not do it if the symbolic representation is
> > richer than the symbolic represenatation of the dataset.
> > 
> > Isn't it a fact that the set of squareroots actually depict numbers in a
> > shorter way than their actual representation.
> 
> Sure. But you need to know what √2 means. It *represents* the number
> 1.41421356237... but doesn't compress it. There's nothing you can do to the
> symbol √2 that will uncompress back to the infinite series of digits. All
> you can do is look it up somewhere to see what the digits are.
> 
> > Now the inpretator or program must know the rules. And i have very good
> > rules to make it happen.
> 
> Right. How much information is in the rules? More than you save with
> the "compression". Consider:
> 
> 1.41421356237 compressed down to √2, that's 13 characters down to 2. Great!
> But to *uncompress*, you need to store a rule:
> 
> √2=1.41421356237 
> 
> and that's *sixteen* characters. So your "compression" is:
> 
> original: 13
> compressed to: 2
> plus rule: 16
> 
> means you have "compressed" 13 characters to 18.
> 
> Now, this is still worth doing if you need to repeat the √2 many times, so
> long as you don't have to repeat the rule. That's useful. But it's not
> compression. It's more like keeping an index to a database, or a scrap of
> paper with the title of a book written on it:
> 
> "See Lord Of The Rings, by J.R.R. Tolkien"
> 
> That's a lot smaller than the actual book: eight words, instead of who knows
> how many tens of thousands. But you can't call it compression: you can't
> sit down with the scrap of paper *and nothing else* and uncompress it back
> to the entire LOTR trilogy.
> 
> 
> 
> 
> -- 
> Steven
> “Cheer up,” they said, “things could be worse.” So I cheered up, and sure
> enough, things got worse.

But it seem your reasoning is based upon interpretation of the actual digits, bits and bytes value. There could be different interpretation worlds of course you would have to chose one using digits, An interpretationworld here could be reading out different word lengths of the dataset and maybe a lookup table.


But it could also be arithmetic rules that magically recreate a number from a number of folds or difference of folds.




More information about the Python-list mailing list