Compression of random binary data

Paul Moore p.f.moore at gmail.com
Mon Oct 23 11:28:59 EDT 2017


On 23 October 2017 at 15:29,  <danceswithnumbers at gmail.com> wrote:
> I'm really not trolling, and even though some are sarcastic i sm learning from your comments.

I'm willing to believe that, but if you're trying to claim you have
"compressed" data (in a way that satisfies the technical,
information-theoretic meaning of the word), you need to be *really*
careful not to include hidden information in your assumptions.

For example, if I have a string made up of only the numbers 0-7, then
I can trivially (octal) store that data in 3 bits per digit. But
that's not compression, as there's "hidden information" in the
knowledge that you're using a restricted character set. Factor in that
information and your string only contains 3 bits of information per
digit.

Using bytes (characters, if you assume a 1-byte encoding) to hold just
the digits 0-9 is inefficient (there's 256 bytes and you're only using
10 of them), and "of course" you can hold that data more efficiently.
But that's not "compression", that's simply using a better encoding.
In the technical sense, "compression" is about looking at redundancies
that go beyond the case of how effectively you pack data into the
bytes available.

> Dec to bin is not bad at removing wasted space

Yep, you're not talking about what people usually refer to as
compression, but rather about optimising an encoding.

>, but there is a better way. Here is an example. How would you compress these numbers.

10 digits = log2(10) bits of information. So getting down to 4 bits is
about encoding. You can go further by using a variable length encoding
and "extra knowledge" about which digits come up most commonly to give
the common digits shorter representation. That's called Gray coding.
You can use the fact that repeated copies of the same digit come up
together a lot to replace them by digit + count. That's run-length
encoding. There are other more complex approaches. But what *all* of
these have in common is that if you provide random input (within the
limits of what you support - digit strings here) then you'll *always*
get at least one input that encodes to a longer output than your
input.

> Compress this:
>
> 4135124325

10 x 10 digits = 10 x log2(10) bits of information = a bit under 34
bits of information

>
> Bin to dec...still very large
> 11110110
> 01111000
> 11111101
> 01100101

4x8 = 32 bits, but there's probably a couple of leading zeros needed
if you want to encode all 10-digit numbers.


> New compression method:
>
> 11000101
> 11000111
> 01001111
>
> A full byte less than bin.

You'll need to explain how to decode this, in a way that can be used
to decode the encodings of arbitrary 10-digit numbers, and with any
leading zeroes that are needed to cover the whole space of 10-digit
numbers, before you can claim you've compressed 10-digit numbers using
only 24 bits. And if you do that, you've basically overturned most of
information theory, so I'd start by assuming there's a flaw in your
argument - sorry about that... ;-)

Hope this helps put the subject into context. Compression is a very
technical subject, to "do it right". Special cases can be worked out,
sure, but the "hidden assumptions" in a method are what make the
difference between a "compression algorithm" and a "way of storing my
particular data more efficiently".

> I know many are skeptical....thats okay.this has taken 6 years, im not going to insult your intelligence with something so juvenile as dec to bin. I'm really trying to find an application for this since it only deals with digits 0-9 or 0-20 or other strange combinations. Wait did you just give it away in that small example....no, because it also encrypts in a way that has never been done. The compression is better than log(256)÷log (10)....wait isn't that impossible, binary is minimalistic. I agree that binary is minimalistic, but the architecture is not, making all arguments conjecture...not laws.

No-one is going to accept a claim that an algorithm you're not willing
to publish is valid. This is about maths/science, not "proprietary
algorithms" or anything like that. If you don't publish your methods,
people will simply point at information theoretic proofs and say
"either you're missing something, or your approach doesn't work in
cases that I care about, so thanks but no thanks".

Paul



More information about the Python-list mailing list