key/value store optimized for disk storage

Fri May 4 02:29:28 EDT 2012

On May 3, 11:03 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> Steve Howell <showel... at yahoo.com> writes:
> > Sounds like a useful technique.  The text snippets that I'm
> > compressing are indeed mostly English words, and 7-bit ascii, so it
> > would be practical to use a compression library that just uses the
> > same good-enough encodings every time, so that you don't have to write
> > the encoding dictionary as part of every small payload.
>
> Zlib stays adaptive, the idea is just to start with some ready-made
> compression state that reflects the statistics of your data.
>
> > Sort of as you suggest, you could build a Huffman encoding for a
> > representative run of data, save that tree off somewhere, and then use
> > it for all your future encoding/decoding.
>
> Zlib is better than Huffman in my experience, and Python's zlib module
> already has the right entry points.  Looking at the docs,
> Compress.flush(Z_SYNC_FLUSH) is the important one.  I did something like
> this before and it was around 20 lines of code.  I don't have it around
> any more but maybe I can write something else like it sometime.
>
> > Is there a name to describe this technique?
>
> Incremental compression maybe?

Many thanks, this is getting me on the right path:

    compressor = zlib.compressobj()
    s = compressor.compress("foobar")
    s += compressor.flush(zlib.Z_SYNC_FLUSH)

    s_start = s
    compressor2 = compressor.copy()

    s += compressor.compress("baz")
    s += compressor.flush(zlib.Z_FINISH)
    print zlib.decompress(s)

    s = s_start
    s += compressor2.compress("spam")
    s += compressor2.flush(zlib.Z_FINISH)
    print zlib.decompress(s)