GC performance with lists

John Krukoff jkrukoff at ltgc.com
Tue Sep 4 12:27:47 EDT 2007


> -----Original Message-----
> From: python-list-bounces+jkrukoff=ltgc.com at python.org [mailto:python-
> list-bounces+jkrukoff=ltgc.com at python.org] On Behalf Of jonas at mit.edu
> Sent: Tuesday, September 04, 2007 8:07 AM
> To: python-list at python.org
> Subject: GC performance with lists
> 
> While working on some python wrapping, I've run into some problems
> where the GC seems to take an unreasonable amount of time to run. The
> code below is a demonstration:
> 
> import gc
> #gc.disable()
> 
> data = []
> for i in xrange(100000):
> 
>     shortdata = []
>     for j in range(57):
>         mytuple = (j, i+1, i+2, i+3, i+4, i+5, i+6)
>         shortdata.append(mytuple)
>     data.extend(shortdata)
> 
> print len(data)
> 
> with gc disabled (the second line) the code runs in 15 seconds, with
> it enabled it runs in 2:15, or ~9x slower. I expected some gc
> overhead, but not an order of magnitude! Am I doing something
> obviously wrong in the above code?
> 
> Thanks,
>      ...Eric
> 
> --
> http://mail.python.org/mailman/listinfo/python-list

The only real optimization I see for this is moving the common
subexpressions (i+1, i+2, etc...) out of the loop as previous poster
suggested. 

Something that looks like this, maybe:
data = []
append = data.append
for i in xrange( 100000 ):
	shortdata = ( i+1, i+2, i+3, i+4, i+5, i+6 )
	for j in range( 57 ):
		append( ( j, ) + shortdata )

That'll help a little, I just checked the docs to be sure, and collection is
triggered by the number of allocations - number of deallocations going over
a certain threshold (700 by default). 

You do realize just what a massive data structure you're building here, too,
right? On my box, building this consumes about 750Mb of memory. You're doing
a massive number of allocations to create it, too, approximately 40 million.
So, if the gc gets called every 700 allocations, you're spending a lot of
time in the gc, and it's got a huge amount of memory it's sweeping.

It sounds to me like you're triggering worst case behaviour for the gc, and
should either change the threshold values, or simply disable the gc before
the loop and reenable it after. A single gc run at the end probably won't be
so bad as all the intermediate ones, though my box has pretty severe issues
doing anything after creating this data structure as it starts swapping to
disc.

My architechtural suggestion would be to refactor this as an iterator if at
all possible, so as to avoid the massive allocation burden.

---------
John Krukoff
helot at comcast.net




More information about the Python-list mailing list