"Help needed - I don't understand how Python manages memory"

"Martin v. Löwis" martin at v.loewis.de
Sun Apr 20 14:51:07 EDT 2008


> In order to deal with 400 thousands texts consisting of 80 million
> words, and huge sets of corpora , I have to be care about the memory
> things. I need to track every word's behavior, so there needs to be as
> many word-objects as words.
> I am really suffering from the memory problem, even 4G  memory space can
> not survive... Only 10,000 texts can kill it in 2 minutes.
> By the way, my program has been optimized to ``del`` the objects after
> traversing, in order not to store the information in memory all the time.

It may then well be that your application leaks memory, however, the
examples that you have given so far don't demonstrate that. Most likely,
you still keep references to objects at some point, causing the leak.

It's fairly difficult to determine the source of such a problem.
As a starting point, I recommend to do

print len(gc.get_objects())

several times in the program, to see how the number of (gc-managed)
objects increases. This number should continually grow up, or else
you don't have a memory leak (or one in a C module which would be
even harder to determine).

Then, from time to time, call

import gc
from collections import defaultdict
def classify():
    counters = defaultdict(lambda:0)
    for o in gc.get_objects():
        counters[type(o)] += 1
    counters = [(freq, t) for t,freq in counters.items()]
    counters.sort()
    for freq,t in counters[-10:]:
        print t.__name__, freq

a number of times, and see what kind of objects get allocated.

Then, for the most frequent kind of object, investigate whether
any of them "should" have been deleted. If any, try to find
out a) whether the code that should have released them was executed,
and b) why they are still referenced (use gc.get_referrers for that).
And so on.

Regards,
Martin



More information about the Python-list mailing list