[Python-Dev] LZMA compression support in 3.3

Sat Aug 27 22:36:52 CEST 2011

On Sat, Aug 27, 2011 at 9:47 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 8/27/2011 9:47 AM, Nadeem Vawda wrote:
>> I'd like to propose the addition of a new module in Python 3.3. The 'lzma'
>> module will provide support for compression and decompression using the
>> LZMA
>> algorithm, and the .xz and .lzma file formats. The matter has already been
>> discussed on the tracker<http://bugs.python.org/issue6715>, where there
>> seems
>> to be a consensus that this is a desirable feature. What are your
>> thoughts?
>
> As I read the discussion, the idea has been more or less accepted in
> principle. However, the current patch is not and needs changes.

Please note that the code I'm talking about is not the same as the patches by
Per Øyvind Karlsen that are attached to the tracker issue. I have been doing
a completely new implementation of the module, specifically to address the
concerns raised by Martin and Antoine.

(As for why I haven't posted my own changes yet - I'm currently an intern at
Google, and they want me to run my code by their open-source team before
releasing it into the wild. Sorry for the delay and the confusion.)

>> The proposed module's API will be very similar to that of the bz2 module;
>> the only differences will be additional keyword arguments to some
>> functions,
>> for specifying container formats and detailed compressor options.
>
> I believe Antoine suggested a PEP. It should summarize the salient points in
> the long tracker discussion into a coherent exposition and flesh out the
> details implied above. (Perhaps they are already in the proposed doc
> addition.)

I talked to Antoine about this on IRC; he didn't seem to think a PEP would be
necessary. But a summary of the discussion on the tracker issue might still
be a useful thing to have, given how long it's gotten.

>> The implementation will also be similar to bz2 - basic compressor and
>> decompressor classes written in C, with convenience functions and a file
>> interface implemented on top of those in Python.
>
> I would follow Martin's suggestions, including doing all i/o with the io
> module and the following:
> "So I would propose that a very thin C layer is created around the C
> library that focuses on the actual algorithms, and that any higher
> layers (in particular file formats) are done in Python."
>
> If we minimize the C code we add and maximize what is done in Python, that
> would maximize the ease of porting to other implementations. This would
> conform to the spirit of PEP 399.

As stated in my earlier response to Martin, I intend to do this. Aside from
I/O, though, there's not much that _can_ be done in Python - the rest is
basically just providing a thin wrapper for the C library.

On Sat, Aug 27, 2011 at 9:58 PM, Dan Stromberg <drsalists at gmail.com> wrote:
> I'd like to better understand why ctypes is (sometimes) frowned upon.
>
> Is it the brittleness?  Tendency to segfault?

The problem (as I understand it) is that ABI changes in a library will
cause code that uses it via ctypes to break without warning. With an
extension module, you'll get a compile failure if you rely on things
that change in an incompatible way. With a ctypes wrapper, you just get
incorrect answers, or segfaults.

> If yes, is there a way of making ctypes less brittle - say, by
> carefully matching it against a specific version of a .so/.dll before
> starting to make heavy use of said .so/.dll?

This might be feasible for a specific application running in a controlled
environment, but it seems impractical for something as widely-used as the
stdlib. Having to include a whitelist of acceptable library versions would
be a substantial maintenance burden, and (compatible) new versions would
not work until the library whitelist gets updated.

Cheers,
Nadeem