Algorithm that makes maximum compression of completly diffused data.

Chris Angelico rosuav at gmail.com
Thu Nov 7 22:24:27 EST 2013


On Fri, Nov 8, 2013 at 1:43 PM, R. Michael Weylandt
<michael.weylandt at gmail.com> <michael.weylandt at gmail.com> wrote:
> Chris's point is more subtle: the typical computer will store the number 65536 in a single byte, but it will also store 4 and 8 in one byte. So if your choice is between sending "65536" and "(4,8)", you actually loose efficiency in your scheme. Don't think in decimal, but in terms of information needing transfer.

Well, 65536 won't fit in a single byte, nor even in two (only just). A
typical binary representation of 65536 would take 3 bytes, or 4 for a
common integer type: 0x00 0x01 0x00 0x00 (big-endian). So it would be
possible to represent that more compactly, if you deem one byte each
for the base and the exponent: 0x04 0x08. However, that system allows
you to represent just 62,041 possible numbers:

>>> decomp={}
>>> for base in range(256):
    for exp in range(256):
        decomp[base**exp]=base,exp

>>> len(decomp)
62041

The trouble is, these numbers range all the way up from 0**0 == 0 to
255**255 == uhh...
>>> 255**255
46531388344983681457769984555620005635274427815488751368772861643065273360461098097690597702647394229975161523887729348709679192202790820272357752329882392140552515610822058736740145045150003072264722464746837070302159356661765043244993104360887623976285955058200326531849137668562738184397385361179287309286327712528995820702180594566008294593820621769951491324907014215176509758404760451335847252744697820515292329680698271481385779516652518207263143889034764775414387732372812840456880885163361037485452406176311868267428358492408075197688911053603714883403374930891951109790394269793978310190141201019287109375

which would decompress to (obviously) 255 bytes. So you can represent
just over sixty thousand possible 255-byte strings in two bytes with
this system.

To generalize it, you'd need to devote a bit somewhere to saying
"There are more to add in". Let's say you do this on the exponent byte
(high bit for convenience), so you have 0x04 0x08 means 65536, and
0x04 0x88 0x01 0x01 means 65536+1 = 65537. Now we have something that
generalizes; but the efficiency is gone - and there are too many ways
to encode the same value. (Bear in mind, for instance, that 0x01 0xNN
for any NN will still just represent 1, and 0x00 0xNN will represent
0. That's wasting a lot of bits.)

The application I can see for this sort of thing is not data
compression, but puzzles. There are lots of puzzles that humans find
enjoyable that represent an entire grid with less data than it
contains - for one popular example, look at Sudoku. You have a 9x9
grid where each cell could contain any of nine digits, which means a
theoretical information density of 9**81; but the entire grid can be
described by a handful of digits and heaps of blank space. This could
be a similarly-fun mathematical puzzle: 3359232 can be represented as
B1**E1 + B2**E2, where all numbers are single-digit. Find B1, B2, E1,
and E2. In this case, you're guaranteed that the end result is shorter
(four digits), but it's hardly useful for general work.

ChrisA



More information about the Python-list mailing list