[Python-Dev] Lack of sequential decompression in the zipfile module

Nilton Volpato nilton.volpato at gmail.com
Wed Mar 7 06:21:54 CET 2007


Hi Derek,

On 2/16/07, Derek Shockey <derek.shockey at gmail.com> wrote:
> Though I am an avid Python programmer, I've never forayed into the area of
> developing Python itself, so I'm not exactly sure how all this works.
>
> I was confused (and somewhat disturbed) to discover recently that the
> zipfile module offers only one-shot decompression of files, accessible only
> via the read() method. It is my understanding that the module will handle
> files of up to 4 GB in size, and the idea of decompressing 4 GB directly
> into memory makes me a little queasy. Other related modules (zlib, tarfile,
> gzip, bzip2) all offer sequential decompression, but this does not seem to
> be the case for zipfile (even though the underlying zlib makes it easy to
> do).

Not so easy, in fact. Unless you open only one zip member file at a time.

If you open many member files concurrently how does file cache will
work? Or how many seeks you will have to do if you read from one
member file and from other alternatingly? Do you have a file-like
interface or just read in chunks? Or, if you need to open more than
one member file for writing in the same zip file, then this is not
possible at all.

> Since I was writing a script to work with potentially very large zipped
> files, I took it upon myself to write an extract() method for zipfile, which
> is essentially an adaption of the read() method modeled after tarfile's
> extract(). I feel that this is something that should really be provided in
> the zipfile module to make it more usable. I'm wondering if this has been
> discussed before, or if anyone has ever viewed this as a problem. I can post
> the code I wrote as a patch, though I'm not sure if my file IO handling is
> as robust as it needs to be for the stdlib. I'd appreciate any insight into
> the issue or direction on where I might proceed from here so as to fix what
> I see as a significant problem.

My Google Summer of Code project was just about this, and I
implemented a lot of nice features. These features include: file-like
access to zip member files (which solves your problem, and also
provides a real file-like interface including .read(), .readline(),
etc); support for BZIP2 compression; support for removing a member
file; support for encrypting/decrypting member files.

The project is hosted at sourceforge [http://ziparchive.sf.net]. You
can take a look, and try it. I'm planning to make a new and improved
release perfecting the API and doing some code refactoring. I really
think that this improved version will be better than all other zip
libraries in every aspect, including number of implemented features,
speed/efficiency, and being easy to use.

I think the time I will take to do this is roughly directly
proportional to the amount of feedback (and help) I receive, since I
alone can't think about all the needs of such a library. Also, if
anyone would like to help developing, that you be great! I have some
local code I'm working in, but I can commit this to an svn branch if
anyone would like to see/help.

Thanks,
-- Nilton


More information about the Python-Dev mailing list