zlib interface semi-broken

Wed Feb 11 04:43:43 EST 2009

Travis wrote:
> On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote:
>> .... I personally would like it and bz2 to get closer to each other...
> 
> Well, I like this idea; perhaps this is a good time to discuss the
> equivalent of some "abstract base classes", or "interfaces", for
> compression.
> 
> As I see it, the fundamental abstractions are the stream-oriented
> de/compression routines.  Given those, one should easily be able to
> implement one-shot de/compression of strings.  In fact, that is the
> way that zlib is implemented; the base functions are the
> stream-oriented ones and there is a layer on top of convenience
> functions that do one-shot compression and decompression.

There are a couple of things here to think about.  I've wanted to
do some low-level (C-coded) search w/o bothering to create strings
until a match.  I've no idea how to push this down in, but I may be
looking for a nice low-level spot to fit.  Characteristics for that
could be read-only access to small expansion parts w/o copying them
out.  Also, in case of a match, a (relatively quick) way to mark points
as we proceed and a (possibly slower) way to resrore from one or
more marked points.

Also, another programmer wants to parallelize _large_ bzip file
expansion by expanding independent blocks in separate threads (we
know how to find safe start points).  To get such code to work, we
need to find big chunks of computation, and (at least optionally)
surround them with GIL release points.

> So what I suggest is a common framework of three APIs; a sequential
> compression/decompression API for streams, a layer (potentially
> generic) on top of those for strings/buffers, and a third API for
> file-like access.  Presumably the file-like access can be implemented
> on top of the sequential API as well.
If we have to be able to start from arbitrary points in bzip files, they
have one nasty characteristic: they are bit-serial, and we'll need to
start them at arbitrary _bit_ points (not simply byte boundaries).

One structure I have used for searching is a result iterator fed by
a source iterator, so rather than a read w/ inconvenient boundaries
the input side of the thing calls the 'next' method of the provided
source.

> ... I would rather see a pythonic interface to the libraries than a
 > simple-as-can-be wrapper around the C functions....
I'm on board with you here.

> My further suggestion is that we start with the sequential
> de/compression, since it seems like a fundamental primitive.
> De/compressing strings will be trivial, and the file-like interface is
> already described by Python.
Well, to be explicit, are we talking about Decompresion and Compression
simultaneously or do we want to start with one of them first?

> 2) The de/compression object has routines for reading de/compressed
> data and states such as end-of-stream or resynchronization points as
> exceptions, much like the file class can throw EOFError.  My problem
> with this is that client code has to be cognizant of the possible
> exceptions that might be thrown, and so one cannot easily add new
> exceptions should the need arise.  For example, if we add an exception
> to indicate a possible resynchronization point, client code may not
> be capable of handling it as a non-fatal exception.

Seems like we may want to say things like, "synchronization points are
too be silently ignored."

--Scott David Daniels
Scott.Daniels at Acm.Org