A fast way to read last line of gzip archive ?

Mon May 25 20:56:02 EDT 2009

"Barak, Ron" <Ron.Barak at lsi.com> writes:

> I couldn't really go with the shell utilities approach, as I have no
> say in my user environment, and thus cannot assume which binaries
> are install on the user's machine.

I suppose if you knew your target you could just supply the external
binaries to go with your application, but I agree that would probably
be more of a pain than its worth for the performance gain in real
world time.

> I'll try and implement your last suggestion, and see if the
> performance is acceptable to (human) users.

In terms of tuning the third option a bit, I'd play with the tracking
of the final two chunk (as mentioned in my first response), perhaps
shrinking the chunk size or only processing a smaller chunk of it for
lines (assuming a reasonable line size) to minimize the final loop.
You could also try using splitlines() on the final buffer rather than
a StringIO wrapper, although that'll have a memory hit for the
constructed list but doing a small portion of the buffer would
minimize that.

I was curious what I could actually achieve, so here are three variants
that I came up with.

First, this just fine tunes slightly tracking the chunks and then only
processes enough final data based on anticipated maximum length length
(so if the final line is longer than that you'll only get the final
MAX_LINE bytes of that line).  I also found I got better performance
using a smaller 1024 chunk size with GZipFile.read() than a MB - not
entirely sure why although it perhaps matches the internal buffer size
better:

    # last-chunk-2.py

    import gzip
    import sys

    CHUNK_SIZE = 1024
    MAX_LINE = 255

    in_file = gzip.open(sys.argv[1],'r')

    chunk = prior_chunk = ''
    while 1:
        prior_chunk = chunk
        # Note that CHUNK_SIZE here is in terms of decompressed data
        chunk = in_file.read(CHUNK_SIZE)
        if len(chunk) < CHUNK_SIZE:
            break

    if len(chunk) < MAX_LINE:
        chunk = prior_chunk + chunk

    line = chunk.splitlines(True)[-1]
    print 'Last:', line

On the same test set as my last post, this reduced the last-chunk
timing from about 2.7s to about 2.3s.

Now, if you're willing to play a little looser with the gzip module,
you can gain quite a bit more.  If you directly call the internal _read()
method you can bypass some of the unnecessary processing read() does, and
go back to larger I/O chunks:

    # last-gzip.py

    import gzip
    import sys

    CHUNK_SIZE = 1024*1024
    MAX_LINE = 255

    in_file = gzip.open(sys.argv[1],'r')

    chunk = prior_chunk = ''
    while 1:
        try:
            # Note that CHUNK_SIZE here is raw data size, not decompressed
            in_file._read(CHUNK_SIZE)
        except EOFError:
            if in_file.extrasize < MAX_LINE:
                chunk = chunk + in_file.extrabuf
            else:
                chunk = in_file.extrabuf
            break

        chunk = in_file.extrabuf
        in_file.extrabuf = ''
        in_file.extrasize = 0

    line = chunk[-MAX_LINE:].splitlines(True)[-1]
    print 'Last:', line

Note that in this case since I was able to bump up CHUNK_SIZE, I take
a slice to limit the work splitlines() has to do and the size of the
resulting list.  Using the larger CHUNK_SIZE (and it being raw size) will
use more memory, so could be tuned down if necessary.

Of course, the risk here is that you are dependent on the _read()
method, and the internal use of the extrabuf/extrasize attributes,
which is where _read() places the decompressed data.  In looking back
I'm pretty sure this code is safe at least for Python 2.4 through 3.0,
but you'd have to accept some risk in the future.

This approach got me down to 1.48s.

Then, just for the fun of it, once you're playing a little looser with
the gzip module, it's also doing work to compute the crc of the
original data for comparison with the decompressed data.  If you don't
mind so much about that (depends on what you're using the line for)
you can just do your own raw decompression with the zlib module, as in
the following code, although I still start with a GzipFile() object to
avoid having to rewrite the header processing:

    # last-decompress.py

    import gzip
    import sys
    import zlib

    CHUNK_SIZE = 1024*1024
    MAX_LINE = 255

    decompress = zlib.decompressobj(-zlib.MAX_WBITS)

    in_file = gzip.open(sys.argv[1],'r')
    in_file._read_gzip_header()

    chunk = prior_chunk = ''
    while 1:
        buf = in_file.fileobj.read(CHUNK_SIZE)
        if not buf:
            break
        d_buf = decompress.decompress(buf)
        # We might not have been at EOF in the read() but still have no
        # decompressed data if the only remaining data was not original data
        if d_buf:
            prior_chunk = chunk
            chunk = d_buf

    if len(chunk) < MAX_LINE:
        chunk = prior_chunk + chunk

    line = chunk[-MAX_LINE:].splitlines(True)[-1]
    print 'Last:', line

This version got me down to 1.15s.

So in summary, the choices when tested on my system ended up at:

    last             26
    last-chunk        2.7
    last-chunk-2      2.3
    last-popen        1.7
    last-gzip         1.48
    last-decompress   1.12

So by being willing to mix in some more direct code with the GzipFile
object, I was able to beat the overhead of shelling out to the faster
utilities, while remaining in pure Python.

-- David