Memory leak in Python

Wed May 10 11:25:54 EDT 2006

diffuser78 at gmail.com wrote:
(top-post corrected)
> 
> bruno at modulix wrote:
> 
>>diffuser78 at gmail.com wrote:
>>
>>>I have a python code which is running on a huge data set. After
>>>starting the program the computer becomes unstable and gets very
>>>diffucult to even open konsole to kill that process. What I am assuming
>>>is that I am running out of memory.
>>>
>>>What should I do to make sure that my code runs fine without becoming
>>>unstable. How should I address the memory leak problem if any ? I have
>>>a gig of RAM.
>>>
>>>Every help is appreciated.
>>
>>Just a hint : if you're trying to load your whole "huge data set" in
>>memory, you're in for trouble whatever the language - for an example,
>>doing a 'buf = openedFile.read()' on a 100 gig file may not be a good
>>idea...
>>
>
> The amount of data I read in is actually small.

So the problem is probably elsewhere... Sorry, since you were talking
about huge dataset, the good old "read-whole-file-in-memory" antipattern
seemed an obvious guess.

> If you see my algorithm above it deals with 2000 nodes and each node
> has ot of attributes.
>
> When I close the program my computer becomes stable and performs as
> usual. I check the performance in Performance monitor and using "top"
> and the total memory is being used and on top of that around half a gig
> swap memory is also being used.
>
> Please give some helpful pointers to overcome such memory errors.

A real memory leak would cause the memory usage to keep increasing as
long as your program is running. If this is not the case, it's not a
"memory error", but a design/program error. FWIW, apps like Zope can end
up using a whole lot of memory, but there's no known memory-leak problem
AFAIK. And believe me, a Zope app can end up managing a *really huge
lot* of objects (>= many thousands).

> I revisited my code to find nothing so obvious which would let this
> leak happen. How to kill cross references in the program.

Using weakref and/or gc might help.

FWIW, the default memory management in Python is based on
reference-counting. As long as anything keeps a reference to an object,
this object will stay alive. If you have lot of cross-references and
2000+ big objects, you may effectively end up eating all the ram and
more. The gc module can detect and manage some cyclic references (obj A
has a ref on obj B which has a ref on obj A). The weakref module uses
'proxy' references that let reference-counting do it's job (I guess the
doc will be much more explicit than me).

Another possible improvement could be to use the flyweight design
pattern to share memory for some attributes :

- a general (while somewhat Java-oriented) explanation:
http://www.exciton.cs.rice.edu/JavaResources/DesignPatterns/FlyweightPattern.htm

- two Python exemples (the second being based on the first)
http://www.suttoncourtenay.org.uk/duncan/accu/pythonpatterns.html#flyweight
http://push.cx/2006/python-flyweights

HTH
-- 
bruno desthuilliers
python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
p in 'onurb at xiludom.gro'.split('@')])"