python implementation of a new integer encoding algorithm.

Chris Angelico rosuav at gmail.com
Tue Feb 17 09:34:47 EST 2015


On Wed, Feb 18, 2015 at 1:12 AM, Dave Angel <davea at davea.name> wrote:
> They had a field type called a "compressed integer."  It could vary between
> one byte and I think about six.  And the theory was that it took less space
> than the equivalent format of fixed size integers.

Oh, incidentally: If you want a decent binary format for
variable-sized integer, check out the MIDI spec. Its varlen integer is
simple: each byte carries seven bits of payload, and the high bit is
set on all except the last byte. Anything under 128 is represented by
the byte with that value; after that, you get a series of continuation
bytes followed by a final byte. It could scale to infinity if you
wanted to, though I think the MIDI spec doesn't have anything that can
go quite that high, and it's a worst-case wastage of 12.5% (or,
looking the other way, worst-case expansion of 14.2%), compared to an
optimal eight-bit encoding. Given that MIDI often works with very
small numbers, but occasionally could have to cope with much larger
ones (eg "time to next event", which is often zero or very low, but
once in a while could be anything at all), this is a good trade-off.
If you know you're frequently going to work with numbers up to 2**31
but only occasionally larger than that, it might be better to take
four bytes at a time, and use the high bit as a continuation bit, in
the same way. It's all trade-offs.

Decimal digits give you roughly 10/3 bits per character (== per byte,
if you encode UTF-8). If you consider that your baseline, you can then
try to justify your algorithm on the basis of "it's twice as efficient
as decimal". Hexadecimal gives you 4 bits per character, at a
relatively small cost; if you can't get better than about 6 bits per
byte average throughput, I would say your algorithm is completely
valueless - the difference will so rarely even matter, and you lose
all the benefits I mentioned in my last email.

I've tried to read through the original algorithm description, but I'm
not entirely sure: How many payload bits per transmitted byte does it
actually achieve?

ChrisA



More information about the Python-list mailing list