A fast way to read last line of gzip archive ?

Barak, Ron Ron.Barak at lsi.com
Mon May 25 02:43:32 EDT 2009


Thanks David: excellent suggestions!
I couldn't really go with the shell utilities approach, as I have no say in my user environment, and thus cannot assume which binaries are install on the user's machine.
I'll try and implement your last suggestion, and see if the performance is acceptable to (human) users.
Bye,
Ron.

> -----Original Message-----
> From: David Bolen [mailto:db3l.net at gmail.com] 
> Sent: Monday, May 25, 2009 01:58
> To: python-list at python.org
> Subject: Re: A fast way to read last line of gzip archive ?
> 
> "Barak, Ron" <Ron.Barak at lsi.com> writes:
> 
> > I thought maybe someone has a way to unzip just the end 
> portion of the 
> > archive (instead of the whole archive), as only the last part is 
> > needed for reading the last line.
> 
> The problem is that gzip compressed output has no reliable 
> intermediate break points that you can jump to and just start 
> decompressing without having worked through the prior data.
> 
> In your specific code, using readlines() is probably not 
> ideal as it will create the full list containing all of the 
> decoded file contents in memory only to let you pick the last 
> one.  So a small optimization would be to just iterate 
> through the file (directly or by calling
> readline()) until you reach the last line.
> 
> However, since you don't care about the bulk of the file, but 
> only need to work with the final line in Python, this is an 
> activity that could be handled more efficiently handled with 
> external tools, as you need not involve much intepreter time 
> to actually decompress/discard the bulk of the file.
> 
> For example, on my system, comparing these two cases:
> 
>     # last.py
> 
>     import gzip
>     import sys
> 
>     in_file = gzip.open(sys.argv[1],'r')
>     for line in in_file:
>         pass
>     print 'Last:', line
> 
> 
>     # last-popen.py
> 
>     import sys
>     from subprocess import Popen, PIPE
> 
>     # Implement gzip -dc <file> | tail -1
>     gzip = Popen(['gzip', '-dc', sys.argv[1]], stdout=PIPE)
>     tail = Popen(['tail', '-1'], stdin=gzip.stdout, stdout=PIPE)
>     line = tail.communicate()[0]
>     print 'Last:', line
> 
> with an ~80MB log file compressed to about 8MB resulted in 
> last.py taking about 26 seconds, while last-popen took about 
> 1.7s.  Both resulted in the same value in "line".  As long as 
> you have local binaries for gzip/tail (such as Cygwin or 
> MingW or equivalent) this works fine on Windows systems too.
> 
> If you really want to keep everything in Python, then I'd 
> suggest working to optimize the "skip" portion of the task, 
> trying to decompress the bulk of the file as quickly as 
> possible.  For example, one possibility would be something like:
> 
>     # last-chunk.py
>     
>     import gzip
>     import sys
>     from cStringIO import StringIO
> 
>     in_file = gzip.open(sys.argv[1],'r')
> 
>     chunks = ['', '']
>     while 1:
>         chunk = in_file.read(1024*1024)
>         if not chunk:
>             break
>         del chunks[0]
>         chunks.append(chunk)
> 
>     data = StringIO(''.join(chunks))
>     for line in data:
>         pass
>     print 'Last:', line
> 
> with the idea that you decode about a MB at a time, holding 
> onto the final two chunks (in case the actual final chunk 
> turns out to be smaller than one of your lines), and then 
> only process those for lines.  There's probably some room for 
> tweaking the mechanism for holding onto just the last two 
> chunks, but I'm not sure it will make a major difference in 
> performance.
> 
> In the same environment of mine as the earlier tests, the 
> above took about 2.7s.  So still much slower than the 
> external utilities in percentage terms, but in absolute 
> terms, a second or so may not be critical for you compared to 
> pure Python.
> 
> -- David
> 
> 


More information about the Python-list mailing list