Unzip: Memory Error

Thu Aug 30 17:07:40 EDT 2007

David Bolen <db3l.net at gmail.com> writes:

> If you are going to read the file data incrementally from the zip file
> (which is what my other post provided) you'll prevent the huge memory
> allocations and risk of running out of resource, but would have to
> implement your own line ending support if you then needed to process
> that data in a line-by-line mode.  Not terribly hard, but more
> complicated than my prior sample which just returned raw data chunks.

Here's a small example of a ZipFile subclass (tested a bit this time)
that implements two generator methods:

read_generator      Yields raw data from the file
readline_generator  Yields "lines" from the file (per splitlines)

It also corrects my prior code posting which didn't really skip over
the file header properly (due to the variable sized name/extra
fields).  Needs Python 2.3+ for generator support (or 2.2 with
__future__ import)

Peak memory use is set "roughly" by the optional chunk parameter.
It's roughly since that's an uncompressed chunk so will grow in memory
during decompression.  And the readline generator adds further copies
for the data split into lines.

For your file processing by line, it could be used as in:

    zipf = ZipFileGen('somefile.zip')

    g = zipf.readline_generator('somefilename.txt')
    for line in g:
        dealwithline(line)

    zipf.close()

Even if not a perfect match, it should point you further in the right
direction.

-- David

          - - - - - - - - - - - - - - - - - - - - - - - - -

import zipfile
import zlib
import struct

class ZipFileGen(zipfile.ZipFile):

    def read_generator(self, name, chunk=65536):
        """Return a generator that yields file bytes for name incrementally.
        The optional chunk parameter controls the chunk size read from the
        underlying zip file.  For compressed files, the data length returned
        by the generator will be larger as the decompressed version of a chunk.

        Note that unlike read(), this method does not preserve the internal
        file pointer and should not be mixed with write operations.  Nor does
        it verify that the ZipFile is still opened and for reading.

        Multiple generators returned by this function are not designed to be
        used simultaneously (they do not re-seek the underlying file for
        each request."""

        zinfo = self.getinfo(name)
        compressed = (zinfo.compress_type == zipfile.ZIP_DEFLATED)
        if compressed:
            dc = zlib.decompressobj(-15)

        self.fp.seek(zinfo.header_offset)

        # Skip the file header (from zipfile.ZipFile.read())
        fheader = self.fp.read(30)
        if fheader[0:4] != zipfile.stringFileHeader:
            raise zipfile.BadZipfile, "Bad magic number for file header"

        fheader = struct.unpack(zipfile.structFileHeader, fheader)
        fname = self.fp.read(fheader[zipfile._FH_FILENAME_LENGTH])
        if fheader[zipfile._FH_EXTRA_FIELD_LENGTH]:
            self.fp.read(fheader[zipfile._FH_EXTRA_FIELD_LENGTH])

        # Process the file incrementally
        remain = zinfo.compress_size
        while remain:
            bytes = self.fp.read(min(remain, chunk))
            remain -= len(bytes)
            if compressed:
                bytes = dc.decompress(bytes)
            yield bytes

        if compressed:
            bytes = dc.decompress('Z') + dc.flush()
            if bytes:
                yield bytes

    def readline_generator(self, name, chunk=65536):
        """Return a generator that yields lines from a file within the zip
        incrementally.  Line ending detection based on splitlines(), and
        like file.readline(), the returned line does not include the line
        ending.  Efficiency not guaranteed if used with non-textual files.

        Uses a read_generator() generator to retrieve file data incrementally,
        so it inherits the limitations of that method as well, and the
        optional chunk parameter is passed to read_generator unchanged."""

        partial = ''
        g = self.read_generator(name, chunk=chunk)

        for bytes in g:
            # Break current chunk into lines
            lines = bytes.splitlines()

            # Add any prior partial line to first line
            if partial:
                lines[0] = partial + lines[0]

            # If the current chunk didn't happen to break on a line ending,
            # save the partial line for next time
            if bytes[-1] not in ('\n', '\r'):
                partial = lines.pop()

            # Then yield the lines we've identified so far
            for curline in lines:
                yield curline

        # Return any trailing data (if file didn't end in a line ending)
        if partial:
            yield partial