How to process a very large (4Gb) tarfile from python?

Thu Jul 17 09:14:45 EDT 2008

On 17 Jul., 10:01, Terry Carroll <carr... at nospam-tjc.com> wrote:
> I am trying to do something with a very large tarfile from within
> Python, and am running into memory constraints.  The tarfile in
> question is a 4-gigabyte datafile from freedb.org,http://ftp.freedb.org/pub/freedb/, and has about 2.5 million members
> in it.
>
> Here's a simple toy program that just goes through and counts the
> number of members in the tarfile, printing a status message every N
> records (N=10,000 for the smaller file; N=100,000 for the larger).
>
> I'm finding that memory usage goes through the roof, simply iterating
> over the tarfile.  I'm using over 2G when I'm barely halfway through
> the file. This surprises me; I'd expect the memory associated with
> each iteration to be released at the end of the iteration; but
> something's obviously building up.
>
> On one system, this ends with a MemoryError exception.  On another
> system, it just hangs, bringing the system to its knees, to the point
> that it takes a minute or so to do simple task switching.
>
> Any suggestions to process this beast?  I suppose I could just untar
> the file, and process 2.5 million individual files, but I'm thinking
> I'd rather process it directly if that's possible.
>
> Here's the toy code.  (One explanation about the "import tarfilex as
> tarfile" statement. I'm running Activestate Python 2.5.0, and the
> tarfile.py module of that vintage was buggy, to the point that it
> couldn't read these files at all.  I brought down the most recent
> tarfile.py fromhttp://svn.python.org/view/python/trunk/Lib/tarfile.py
> and saved it as tarfilex.py.  It works, at least until I start
> processing some very large files, anyway.)
>
> import tarfilex as tarfile
> import os, time
> SOURCEDIR = "F:/Installs/FreeDB/"
> smallfile = "freedb-update-20080601-20080708.tar" # 63M file
> smallint = 10000
> bigfile   = "freedb-complete-20080708.tar"  # 4,329M file
> bigiTnt = 100000
>
> TARFILENAME, INTERVAL = smallfile, smallint
> # TARFILENAME, INTERVAL = bigfile, bigint
>
> def filetype(filename):
>     return os.path.splitext(filename)[1]
>
> def memusage(units="M"):
>     import win32process
>     current_process = win32process.GetCurrentProcess()
>     memory_info = win32process.GetProcessMemoryInfo(current_process)
>     bytes = 1
>     Kbytes = 1024*bytes
>     Mbytes = 1024*Kbytes
>     Gbytes = 1024*Mbytes
>     unitfactors = {'B':1, 'K':Kbytes, 'M':Mbytes, 'G':Gbytes}
>     return memory_info["WorkingSetSize"]//unitfactors[units]
>
> def opentar(filename):
>     modes = {".tar":"r", ".gz":"r:gz", ".bz2":"r:bz2"}
>     openmode = modes[filetype(filename)]
>     openedfile = tarfile.open(filename, openmode)
>     return openedfile
>
> TFPATH=SOURCEDIR+'/'+TARFILENAME
> assert os.path.exists(TFPATH)
> assert tarfile.is_tarfile(TFPATH)
> tf = opentar(TFPATH)
> count = 0
> print "%s memory: %sM count: %s (starting)" % (time.asctime(),
> memusage(), count)
> for tarinfo in tf:
>     count += 1
>     if count % INTERVAL == 0:
>         print "%s memory: %sM count: %s" % (time.asctime(),
> memusage(), count)
> print "%s memory: %sM count: %s (completed)" % (time.asctime(),
> memusage(), count)
>
> Results with the smaller (63M) file:
>
> Thu Jul 17 00:18:21 2008 memory: 4M count: 0 (starting)
> Thu Jul 17 00:18:23 2008 memory: 18M count: 10000
> Thu Jul 17 00:18:26 2008 memory: 32M count: 20000
> Thu Jul 17 00:18:28 2008 memory: 46M count: 30000
> Thu Jul 17 00:18:30 2008 memory: 55M count: 36128 (completed)
>
> Results with the larger (4.3G) file:
>
> Thu Jul 17 00:18:47 2008 memory: 4M count: 0 (starting)
> Thu Jul 17 00:19:40 2008 memory: 146M count: 100000
> Thu Jul 17 00:20:41 2008 memory: 289M count: 200000
> Thu Jul 17 00:21:41 2008 memory: 432M count: 300000
> Thu Jul 17 00:22:42 2008 memory: 574M count: 400000
> Thu Jul 17 00:23:47 2008 memory: 717M count: 500000
> Thu Jul 17 00:24:49 2008 memory: 860M count: 600000
> Thu Jul 17 00:25:51 2008 memory: 1002M count: 700000
> Thu Jul 17 00:26:54 2008 memory: 1145M count: 800000
> Thu Jul 17 00:27:59 2008 memory: 1288M count: 900000
> Thu Jul 17 00:29:03 2008 memory: 1430M count: 1000000
> Thu Jul 17 00:30:07 2008 memory: 1573M count: 1100000
> Thu Jul 17 00:31:11 2008 memory: 1716M count: 1200000
> Thu Jul 17 00:32:15 2008 memory: 1859M count: 1300000
> Thu Jul 17 00:33:23 2008 memory: 2001M count: 1400000
> Traceback (most recent call last):
>   File "C:\test\freedb\tardemo.py", line 40, in <module>
>     for tarinfo in tf:
>   File "C:\test\freedb\tarfilex.py", line 2406, in next
>     tarinfo = self.tarfile.next()
>   File "C:\test\freedb\tarfilex.py", line 2311, in next
>     tarinfo = self.tarinfo.fromtarfile(self)
>   File "C:\test\freedb\tarfilex.py", line 1235, in fromtarfile
>     obj = cls.frombuf(buf)
>   File "C:\test\freedb\tarfilex.py", line 1193, in frombuf
>     if chksum not in calc_chksums(buf):
>   File "C:\test\freedb\tarfilex.py", line 261, in calc_chksums
>     unsigned_chksum = 256 + sum(struct.unpack("148B", buf[:148]) +
> struct.unpack("356B", buf[156:512]))
> MemoryError

I had a look at tarfile.py in my current Python 2.5 installations
lib path. The iterator caches TarInfo objects in a list
tf.members . If you only want to iterate and you  are not interested
in more functionallity, you could use "tf.members=[]" inside
your loop. This is a dirty hack !

Greetings, Uwe