Jeremy Hylton :
weblog :
April 2003
| last modified Thu Mar 17 01:11:16 2005
| |
Debugging memory leaks | permanent link |
Tim and I have spent much of the last week debugging memory leaks in Zope. All the leaks were found by running unit tests or running tiny demo scripts in a loop. A few weeks ago, I also wrote a trivial benchmark for ZODB4. I just wanted to measure how quickly I could create persistent mail message objects and add them to a b-tree index. One conclusion is that functional testing of this sort is really important. None of the stress tests were hard to set up, but they found a large number of leaks and performance problems. The code base is looking much better now.
The chase has lead us all over Zope and Python. The Python garbage collector needed some significant changes in the way it looked for objects with finalizers.
If the collector finds a cycle of objects that contains a finalizer, it does not collect the objects in the cycle. If there are multiple finalizers, there is no way for the language to know what order to invoke them in. It seems possible that each finalizer depends on the state of the other object, such that each must be run first. Guido made on helpful observation, though: If a cycle has one finalizer, it's always safe to run it.
The problem was the Python was using C code that was roughly
equivalent to hasattr(obj, "__del__"). The danger is
that the hasattr could execute an arbitrary amount of Python code,
because of a getattr hook or a custom descriptor. It's not safe to
run any Python code during garbage collection, because it could
deallocate objects or make previously unreachable objects reachable
again. We actually ran into both cases in the ZODB code that was
failing.
The fix for the garbage collector ended up being quite satisfying, although it took quite a while to diagnose the problem and understand the right way to fix it. The key idea is that a finalizer (__del__ method) must always be defined by an object's class. An __del__ attribute on an object is not treated as a finalizer. Since the class defines the finalizer, we can look in the dictionaries of base classes for the method without executing any Python code. One corner case of interest is that if you find a descriptor for __del__, then you assume the object has a finalizer without actually calling the descriptor. (In most cases, if you find a descriptor, you call it right away.)
There are a bunch of handy techniques for debugging memory leaks that ought to be collected somewhere. Perhaps the most important idea is that if you run an isolated code fragment in a loop and run the garbage collector each time around the loop, the total number of reference counts should stay the same. (An isolated fragment is one that doesn't make objects reachable from some external object, like sys.modules.) In practice, it will take a few iterations for the code to settle down. The first time it will probably import some module or do other initialization.
A debug build of Python provides valuable tools for deciding if the code fragment is leaking. The function sys.gettotalrefcount() returns the sum of all object reference counts (only in a debug build). If this number goes up, then you are leaking references at the very least. The function sys.getobjects() returns a list of all the objects in the interpreter, with the most recently allocated objects first. If the number of objects increases each time around the loop, you are leaking objects. This list provides a good way to analyze the leak. Tim wrote a routine that groups objects by type and reports how many objects of each type were created since the last call to the routine.
If an object leaks, it's a bug in C code somewhere. Either a C extension or some part of the Python core is missing a Py_DECREF() call. Knowing the type of the object may be enough to tell you where to look for the ref count problem.
In practice, a missing decref usually causes a group of objects to be leaked. It can be hard to figure out which object is responsible for the leak. One technique that proved helpful was to print the reference count and referrers for each object that was leaking.
for o in leaky_objects:
print hex(id(o)), type(o), sys.getrefcount(o), len(gc.get_referrers(o))
Normally, the number of references will be one greater then the number of referrers. (There is an extra, temporary reference created when getrefcount() is called.) If the difference between refcount and number of referrers is more than one, there is an external reference to the object. That is, a reference from an object the garbage collector can't find. Some code forgot to decref this object.
Metaclass for struct-like classes | permanent link |
I just noticed a reference to Alex Martelli's MetaBunch class. It looks like a useful helper, and would make a good addition to Demo/newmetaclasses/.
Packing a storage | permanent link |
Rys McCusker asked why every storage implementation needs to provides its own garbage collection (via the pack() method). The garbage collector is hard to implement. So it would be better to implement it once and allow all the storages to share that implementation. I didn't have a good answer when he asked, but I've spent about a week struggling with pack() and have one now.
The garbage collector implements the same policy for each storage: A pack, given a specified pack time, allows the storage to reclaim storage for revisions that are only needed to support undo of transactions before the pack time. That is, if an object revision is neither the current revision at the pack time nor referenced as a result of an undo or version operation after the pack time. An efficient implementation of this policy depends on all sorts of details of the storage implementation.
I have been working on a new pack implementation for FileStorage for the last several days. The current implementation is very complicated and seems impossible to maintain. As a result, I've resisted any substantial refactoring. But we've found several bugs recently and small patches to fix those bugs just increase the complexity. No choice but to start from scratch and try to do the simplest thing.
The pack implementation that Barry came up with for Berkeley storages is fairly simple, but can't be used for FileStorage at all. The chief difference is that Berkeley has data structures that can be updated in place. The storage can add and remove elements from a Berkeley table as needed. FileStorage is an append-only log, so it isn't practical to maintain auxiliary data structures. BerkeleyDB also allows these data structures to be moved between memory and disk transparently. In FileStorage, data structures like the index are always in memory.
Berkeley storage uses reference counting to do most of the work. The data and references for an object revision are stored in a separate table. Every transaction that uses the same data has a reference to the entry in the pickle table. (In FileStorage, there is a backpointer to the first transaction that wrote the pickle.) When a transaction is removed by pack, the reference count is decremented. (The Berkeley pack uses a mark-and-sweep phase to handle unreachable objects.)
The new FileStorage pack implementation starts by determining what data records written before the pack time are reachable. A FileStorage is pack by copying all the reachable records from the original file into a new file. Reachability is harder to determine that you might expect, because the undo and versions features create references to object revisions that otherwise unreachable. Tim helped me develop a straightforward reachability pass. It still needs revision to handle versions and to be space efficient.
Client-server access to ZODB | permanent link |
One component of ZODB is ZEO, a client-server storage layer that allows many clients to share a single database. Client-server support is a critical feature. Even if you only expect a single client during normal operation, it is useful to connect a second client for debugging.
The current ZEO system (2.0) works well enough, but it is fairly complex and difficult to extend. If I had free reign to start over, there are a bunch of interesting possibilities to consider.
RPC protocol
The current ZEO implementation uses version two of a homegrown RPC protocol. One serious problem with the current scheme is that it uses cPickle to serialize objects before sending them over the wire. cPickle is very efficient for serializing small objects. Since all the logic of serialization is in C, it's cheap to construct the pickle. But pickle does badly for large string objects -- like the objects returned from load(). The pickler creates a copy of the string. So to return a very large object from ZEO, we need to load the object into memory, then copy it, then send it out over asyncore. Sending it via asyncore will almost certainly break it up into little pieces. The client has to do the same thing on unpickling. This can lead to excess memory usage or outright failures (MemoryErrors) when fragmentation makes it impossible to allocate a big enough buffer.
Most of the ZEO protocol involves asynchronous methods. The store() method, for example, is asynchronous. This ends up complicating the storage API, because the storage must return a new serial number to the client. The asynchrony is important, because it improves performance. An interesting middle ground would be to extend the RPC layer with support for promises.
Given the existing problems and the paucity of time available to work on it, a better strategy is probably to use an existing solution for the communications layer. The two most promising candidates are probably Twisted's perspective broker and omniORB for Python.
Client cache
The client cache is critical for good ZEO performance. The cache keeps copies of recently used objects on disk, so that subsequent accesses don't require a trip to the server.
Guido developed a trace-driven simulator for the ZEO cache that lead to a number of improvements. The current cache keeps two data files. Every time an object is loaded, it is save in the current data file. When the current data file fills up, the cache deletes the other data file and starts with a new, empty current file. It's a very simple scheme (perhaps too simple), but it works well enough in practice if the cache is very large. Guido's simulator also showed that a buddy cache would work well.
Prefetching
Prefetching could improve the performance of the client cache. It's also possible to use a page-based cache system to achieve some benefits over the object-based cache we've got now.
If we can anticipate client access patterns, we can fetch objects before they are needed. The prefetched objects could be delivered in the background while the application is running. Or the server could send a group of objects in response to a request for one.
One need the Chandler project identified for its ZODB integration was the ability to exploit their server's bulk fetch API; for example, prefetching all the objects in a query's result set.
Mark Day's thesis suggests that a small amount of structural caching could be a good thing, where structural caching means that structure of application objects is exploited to inform caching. We would need to revise ZEO to allow clients to receive objects asynchronously.
How Big Are Persistent Objects? | permanent link |
Python objects use a lot of memory. I think people are often surprised at just how much memory they use. Persistent objects in Zope use even more memory than regular objects. One reason is that the objects store extra data in the C struct for the object. Another reason is that each database connection loads independent copies of all the objects.
It's important to minimize the amount of memory that persistent objects use, because we want to provide minimum extra overhead for using ZODB in Python programs.
It may be more important to minimize the memory for ghost objects (surrogates). A ghost object is loaded in memory because some other object has referred to it, but it is a ghost because it hasn't been access recently. The only reason a ghost is in memory is to preserve pointer equality, so it seems wasteful if a ghost uses a lot of memory.
The memory usage for a persistent object stacks up like this:
| Bytes | Usage |
|---|---|
| 12 | Python GC header |
| 8 | PyObject_HEAD |
| 20 | PyPersist_HEAD (data mgr, oid, serial, atime, state) |
| 8 | Pointer to instance dict |
| 8 | Pointer to weakref list |
| 144 | instance dict (holding up to 5 attrs) |
| 28 | PyStringObject for 8-byte oid |
| 28 | PyStringObject for 8-byte serial number |
| 256 | Total |
The C code from persistence.h is:
#define PyPersist_HEAD
PyObject_HEAD
PyObject *po_dm;
PyObject *po_oid;
PyObject *po_serial;
int po_atime;
enum PyPersist_State po_state;
typedef struct {
PyPersist_HEAD
} PyPersistObject;
There is other overhead, too. obmalloc will round up to an 8-byte offset for each malloc'd chunk. So the two strings actually use 32 bytes. And the GC header gets aligned with a long double, so on some platforms it will actually be 16 bytes. In the worst case, that's 12 more bytes.
A ghost is smaller because the instance dict is freed, but it still takes 96 bytes.
There are a few things we can do to make objects take less space. We'll probably work on that soon. First, we can use Python longs for oids and serial numbers instead of strings. In general, a 64-bit long takes as much space as an 8-byte string. OIDs are usually densely allocated, so they tend to be much smaller than 64 bits. A 30-bit OID will fit in a long that takes only 16 bytes. Using longs instead of strings will save 24 bytes (32 including obmalloc overhead).
Another space savings is to take the serial number out of the C struct for the object. In general, it's good to keep data out of the C struct, because a ghost still uses whatever memory was initially allocated for it. A ghost does not have a serial number. (A serial number is associated with the revision of the object that was loaded from the database when the object was unghosted.) If the serial number is stored in the instance dict instead, we will save 4 bytes in the object header. It will make using slots more complicated, but I think that's okay.
A third savings is to eliminate or shrink the po_atime slot that is currently used to provide LRU cache eviction. The po_atime and po_state slots are both ints, so they consume a total of 8 bytes. We should be able to fit the state in a single byte, and perhaps eliminate po_atime. It's hard to know how to replace the po_atime. ZODB3 uses a doubly-linked list in the C struct, but that adds 8 bytes of overhead to all objects. One possibility is to use Corbato's CLOCK algorithm, which is what Thor's adapative cache uses. (The details are different in Thor, because it is page-based, while ZODB is object-based.)
It is possible to use slots to eliminate the need for the instance dict, as well as the instance dict pointer and the weaklist pointer. On the other hand, if you use slots, you need to explicitly include _p_serial in the list. Perhaps there should be a base class for persistent classes with slots that does this automatically.
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
| 1 | 2 | 3 | 4 | 5 | ||
| 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | |||
| Mar May | ||||||
| Copyright © 2003 by Jeremy Hylton <jeremy@alum.mit.edu> | source |