gc penalty of 30-40% when manipulating large data structures?

Fri Nov 16 10:59:16 EST 2007

On Nov 16, 2007 8:34 AM, Aaron Watters <aaron.watters at gmail.com> wrote:
> Poking around I discovered somewhere someone saying that
> Python gc adds a 4-7% speed penalty.
>
> So since I was pretty sure I was not creating
> reference cycles in nucular I tried running the tests with garbage
> collection disabled.
>
> To my delight I found that index builds run 30-40% faster without
> gc.  This is really nice because testing gc.collect() afterward
> shows that gc was not actually doing anything.
>
> I haven't analyzed memory consumption but I suspect that should
> be significantly improved also, since the index builds construct
> some fairly large data structures with lots of references for a
> garbage collector to keep track of.
>
> Somewhere someone should mention the possibility that disabling
> gc can greatly improve performance with no down side if you
> don't create reference cycles.  I couldn't find anything like this
> on the Python site or elsewhere.  As Paul (I think) said, this should
> be a FAQ.
>
> Further, maybe Python should include some sort of "backoff"
> heuristic which might go like this: If gc didn't find anything and
> memory size is stable, wait longer for the next gc cycle.  It's
> silly to have gc kicking in thousands of times in a multi-hour
> run, finding nothing every time.
>

The GC has a heuristic where it kicks in when (allocations -
deallocations) exceeds a certain threshold, which has (sometimes quite
severe) implications for building large indexes. This doesn't seem to
be very well known (it's come up at least 3-4 times on this list in
the last 6 months) and the heuristic is probably not a very good one.
If you have some ideas for improvements, you can read about the
current GC in the gc module docs (as well as in the source) and can
post them on python-ideas.