Jeremy Hylton : weblog : 2003-04-10

Debugging memory leaks

Thursday, April 10, 2003

Tim and I have spent much of the last week debugging memory leaks in Zope. All the leaks were found by running unit tests or running tiny demo scripts in a loop. A few weeks ago, I also wrote a trivial benchmark for ZODB4. I just wanted to measure how quickly I could create persistent mail message objects and add them to a b-tree index. One conclusion is that functional testing of this sort is really important. None of the stress tests were hard to set up, but they found a large number of leaks and performance problems. The code base is looking much better now.

The chase has lead us all over Zope and Python. The Python garbage collector needed some significant changes in the way it looked for objects with finalizers.

If the collector finds a cycle of objects that contains a finalizer, it does not collect the objects in the cycle. If there are multiple finalizers, there is no way for the language to know what order to invoke them in. It seems possible that each finalizer depends on the state of the other object, such that each must be run first. Guido made on helpful observation, though: If a cycle has one finalizer, it's always safe to run it.

The problem was the Python was using C code that was roughly equivalent to hasattr(obj, "__del__"). The danger is that the hasattr could execute an arbitrary amount of Python code, because of a getattr hook or a custom descriptor. It's not safe to run any Python code during garbage collection, because it could deallocate objects or make previously unreachable objects reachable again. We actually ran into both cases in the ZODB code that was failing.

The fix for the garbage collector ended up being quite satisfying, although it took quite a while to diagnose the problem and understand the right way to fix it. The key idea is that a finalizer (__del__ method) must always be defined by an object's class. An __del__ attribute on an object is not treated as a finalizer. Since the class defines the finalizer, we can look in the dictionaries of base classes for the method without executing any Python code. One corner case of interest is that if you find a descriptor for __del__, then you assume the object has a finalizer without actually calling the descriptor. (In most cases, if you find a descriptor, you call it right away.)

There are a bunch of handy techniques for debugging memory leaks that ought to be collected somewhere. Perhaps the most important idea is that if you run an isolated code fragment in a loop and run the garbage collector each time around the loop, the total number of reference counts should stay the same. (An isolated fragment is one that doesn't make objects reachable from some external object, like sys.modules.) In practice, it will take a few iterations for the code to settle down. The first time it will probably import some module or do other initialization.

A debug build of Python provides valuable tools for deciding if the code fragment is leaking. The function sys.gettotalrefcount() returns the sum of all object reference counts (only in a debug build). If this number goes up, then you are leaking references at the very least. The function sys.getobjects() returns a list of all the objects in the interpreter, with the most recently allocated objects first. If the number of objects increases each time around the loop, you are leaking objects. This list provides a good way to analyze the leak. Tim wrote a routine that groups objects by type and reports how many objects of each type were created since the last call to the routine.

If an object leaks, it's a bug in C code somewhere. Either a C extension or some part of the Python core is missing a Py_DECREF() call. Knowing the type of the object may be enough to tell you where to look for the ref count problem.

In practice, a missing decref usually causes a group of objects to be leaked. It can be hard to figure out which object is responsible for the leak. One technique that proved helpful was to print the reference count and referrers for each object that was leaking.

for o in leaky_objects:
    print hex(id(o)), type(o), sys.getrefcount(o), len(gc.get_referrers(o))

Normally, the number of references will be one greater then the number of referrers. (There is an extra, temporary reference created when getrefcount() is called.) If the difference between refcount and number of referrers is more than one, there is an external reference to the object. That is, a reference from an object the garbage collector can't find. Some code forgot to decref this object.