question on using tarfile to read a *.tar.gzip file

Tim Chase python.list at tim.thechases.com
Sun Feb 7 18:01:24 EST 2010


> Is there a way to do this, without decompressing each file to a temp
> dir?  Like is there a method using some tarfile interface adapter to
> read a compressed file?  Otherwise I'll just access each file, extract
> it,  grab the 1st and last lines and then delete the temp file.

I think you're looking for the extractfile() method of the 
TarFile object:

   from glob import glob
   from tarfile import TarFile
   for fname in glob('*.tgz'):
     print fname
     tf = TarFile.gzopen(fname)
     for ti in tf:
       print ' %s' % ti.name
       f = tf.extractfile(ti)
       if not f: continue
       fi = iter(f) # f doesn't natively support next()
       first_line = fi.next()
       for line in fi: pass
       f.close()
       print "  First line: %r" % first_line
       print "  Last line: %r" % line
     tf.close()

If you just want the first & last lines, it's a little more 
complex if you don't want to scan the entire file (like I do with 
the for-loop), but the file-like object returned by extractfile() 
is documented as supporting seek() so you can skip to the end and 
then read backwards until you have sufficient lines.  I wrote a 
"get the last line of a large file using seeks from the EOF" 
function which you can find at [1] which should handle the odd 
edge cases of $BUFFER_SIZE containing more or less than a full 
line and then reading backwards in chunks (if needed) until you 
have one full line, handling a one-line file, and other 
odd/annoying edge-cases.  Hope it helps.

-tkc

[1]
http://mail.python.org/pipermail/python-list/2009-January/1186176.html





More information about the Python-list mailing list