python implementation of a new integer encoding algorithm.

Tue Feb 17 09:50:01 EST 2015

On 02/17/2015 09:34 AM, Chris Angelico wrote:
> On Wed, Feb 18, 2015 at 1:12 AM, Dave Angel <davea at davea.name> wrote:
>> They had a field type called a "compressed integer."  It could vary between
>> one byte and I think about six.  And the theory was that it took less space
>> than the equivalent format of fixed size integers.
>
> Oh, incidentally: If you want a decent binary format for
> variable-sized integer, check out the MIDI spec. Its varlen integer is
> simple: each byte carries seven bits of payload, and the high bit is
> set on all except the last byte. Anything under 128 is represented by
> the byte with that value; after that, you get a series of continuation
> bytes followed by a final byte. It could scale to infinity if you
> wanted to, though I think the MIDI spec doesn't have anything that can
> go quite that high, and it's a worst-case wastage of 12.5% (or,
> looking the other way, worst-case expansion of 14.2%), compared to an
> optimal eight-bit encoding. Given that MIDI often works with very
> small numbers, but occasionally could have to cope with much larger
> ones (eg "time to next event", which is often zero or very low, but
> once in a while could be anything at all), this is a good trade-off.
> If you know you're frequently going to work with numbers up to 2**31
> but only occasionally larger than that, it might be better to take
> four bytes at a time, and use the high bit as a continuation bit, in
> the same way. It's all trade-offs.

I have used that 7 bits/byte encoding myself for variable length 
integers.  It worked well enough for its purpose.

>
> Decimal digits give you roughly 10/3 bits per character (== per byte,
> if you encode UTF-8). If you consider that your baseline, you can then
> try to justify your algorithm on the basis of "it's twice as efficient
> as decimal". Hexadecimal gives you 4 bits per character, at a
> relatively small cost; if you can't get better than about 6 bits per
> byte average throughput, I would say your algorithm is completely
> valueless - the difference will so rarely even matter, and you lose
> all the benefits I mentioned in my last email.
>
> I've tried to read through the original algorithm description, but I'm
> not entirely sure: How many payload bits per transmitted byte does it
> actually achieve?

Somehow when I first read the message, I missed the fact that there was 
a link to a web page with source code.

But the first thing I'd expect to see would be a target estimate of the 
anticipated distribution of number values/magnitudes.  For example, if a 
typical integer is 1500 bits, plus/minus 200 bits, I'd probably try 
encoding in base 65535, and have a terminator character of 0xffff.

Without such a target, it's not a compression scheme, but an encoding one.

An interesting point of history.  At the time of Huffman's paper, it was 
provable that it was the best possible lossless compression.  But there 
were some unwritten assumptions, such as that each character would be 
encoded with a whole number of bits.  Changing that assumption makes 
arithmetic coding possible.  Next assumption was that probabilities of 
each character didn't depend on context.  Changing that assumption 
enables LZ.  And so on.

-- 
DaveA