Tremendous slowdown due to garbage collection

Sat Apr 12 12:11:08 EDT 2008

I should have been more specific about possible fixes.

> > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(2000000)]'
>
> 10 loops, best of 3: 662 msec per loop
>
> > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(2000000)]'
>
> 10 loops, best of 3: 15.2 sec per loop
>
> In the latter case, the garbage collector apparently consumes about
> 97% of the overall time.
>

In a related thread on http://bugs.python.org/issue2607
Amaury Forgeot d'Arc suggested a setting of the GC
thresholds that actually solves the problem for me:

> Disabling the gc may not be a good idea in a real application; I suggest
> you to play with the gc.set_threshold function and set larger values, at
> least while building the dictionary. (700, 1000, 10) seems to yield good
> results.

> python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in range(2000000)]'
10 loops, best of 3: 658 msec per loop

which made me suggest to use these as defaults, but then
Martin v. Löwis wrote that

> No, the defaults are correct for typical applications.

At that point I felt lost and as the general wish in that thread was
to move
discussion to comp.lang.python, I brought it up here, in a modified
and simplified form.

> I would suggest to configure the default behaviour of the garbage
> collector in such a way that this squared complexity is avoided
> without requiring specific knowledge and intervention by the user. Not
> being an expert in these details I would like to ask the gurus how
> this could be done.
> I hope this should be at least technically possible, whether it is
> really desirable or important for a default installation of Python
> could then be discussed once the disadvantages of such a setting would
> be apparent.

I still don't see what is so good about defaults that lead to O(N*N)
computation for a O(N) problem, and I like Amaury's suggestion a lot,
so I would like to see comments on its disadvantages. Please don't
tell me that O(N*N) is good enough. For N>1E7 it isn't.

About some other remarks made here:

> I think the linguists of the world should write better automated
> translation systems. Not being an expert in these details I would like
> to ask the gurus how it could be done.

I fully agree, and that is why I (as someone involved with MT) would
prefer to focus on MT and not on memory allocation issues, and by the
way, this is why I normally prefer to work in Python, as it normally
lets me focus on the problem instead of the tool.

> There are going to be pathological cases in any memory allocation
> scheme. The fact that you have apparently located one calls for you to
> change your approach, not for the finely-tuned well-conceived garbage
> collection scheme to be abandoned for your convenience.

I do not agree at all. Having to deal with N>1E7 objects is IMHO not
pathological, it is just daily practice in data-oriented work (we're
talking about deriving models from Gigawords, not about toy systems).

> If you had made a definite proposal that could have been evaluated you
> request would perhaps have seemed a little more approachable.

I feel it is ok to describe a generic problem without having found
the answer yet.
(If you disagree: Where are your definite proposals wrt. MT ? ;-)
But I admit, I should have brought up Amaury's definite proposal
right away.

> A question often asked ... of people who are creating
> very large data structures in Python is "Why are you doing that?"
> That is, you should consider whether some kind of database solution
> would be better.  You mention lots of dicts--it sounds like some
> balanced B-trees with disk loading on demand could be a good choice.

I do it because I have to count frequencies of pairs of things that
appear in real data. Extremely simple stuff, only interesting because
of the size of the data set. After Amaury's hint to switch GC
temporarily
off I can count 100M pairs tokens (18M pair types) in about 15
minutes,
which is very cool, and I can do it in only a few lines of code, which
is even cooler.
Any approach that would touch the disk would be too slow for me, and
bringing in complicated datastructures by myself would distract me
too much from what I need to do. That's exactly why I use Python;
it has lots of highly fine-tuned complicated stuff behind the scenes,
that is just doing the right thing, so I should not have to (and I
could
not) re-invent this myself.

Thanks for the helpful comments,
Andreas