|Jeremy Hylton : weblog : 2003-10-31|
Friday, October 31, 2003, 2:04 p.m.
How do you implement high-performance shared objects using optimistic concurrency control? This note sketches some ideas about scalable data structures for ZODB.
The catalog is a major hotspot in a lot of Zope applications. Any time content is modified, the catalog re-indexes the content. The catalog's indexes cause several problems. They are a hotspot, because they are frequently updated, and they interfere with undo, because a single transaction can get content modification mixed up with index changes.
We can relax application-level consistency by batching catalog operations. The catalog will be a little out of date, but we can improve performance a lot in return. If we can collect batched operations into per-client buckets, we can avoid conflicts and scale to many clients.
ZODB supports conflict resolution, a scheme that works a bit like field calls. If a transaction cause a write conflict, it's resolver is executed at commit time. The resolver can try to compute a new state for the object that takes both changes into account. It can allow a transaction to succeed, instead of aborting and causing the client to retry.
A conflict resolution method is invoked with three copies of the object state: the state written by the current transaction, the state of the object when the transaction started, and the current state in the database. The first two states can be used to compute a logical diff; what change was the current transaction trying to make? Given the current state in the database, can the same change be made to it? If so, then conflict resolution can succeed. (Jim Fulton reminded me of how helpful it is to think about conflict resolution this way.)
A field call, described in Gray and Reuter's Transaction Processing book, is an old technique used in databases with locking. An update consists of a predicate and a transform. If the predicate is true when the transaction runs, then the transform is applied to the current state at commit time. It allows you to minimize the time you hold locks; you just need a read lock for the predicate, and the transform executes quickly. Conflict resolution is essentially the transform, but the predicate is the entire transaction up to that time.
In client-server ZODB, we chose to implement conflict resolution on the server. Normally, the server just writes database records to disk. Occasionally, it does garbage collection (pack), and needs to follow object references in that case. For conflict resolution, it's actually loading object state and running arbitrary application code. That's a scary prospect, but was justified on performance grounds. No measurements taken so far as I know, but it would take a long time if the server sent that data to the client and waited for it to return a new value. During that time, the database would be locked, preventing other transactions from committing.
There are three primary downsides to running conflict resolution on the server. The server is a scare resource, so it needs to be as efficient as possible. It can be hard to guarantee that client and server are running the same version of the software, which is usually necessary for conflict resolution to work. And, worst of all, bugs in the application's conflict resolution code can crash or hang the server.
One alternative, which I don't recall from the original design work, is to raise a conflict error that includes the three object states, abort the transaction, and have the client retry. The client would just run conflict resolution and retry the transaction with all the other state; it would not re-run the entire transaction. That would push the extra computation back to the client without the full cost of a retry. Since the server is often a bottleneck, it might be worth measuring.
Conflict resolution has been useful for specialized data structures like BTrees. In the case of the catalog, it may be better to design a different data structure that avoids conflicts all together or makes them unlikely during normal operation, including under heavy load.
The catalog queue allows catalog updates to be delayed. The idea is to put update events into a queue in the committing transaction. A different transaction periodically removes events from the queue and indexes the documents. The index process gets to amortize the cost of updating the index across multiple updates. If a document is updated several times in short time period, only one index operation is needed. It trades relaxed consistency between data and catalog for increased performance. The catalog queue objects use a complex conflict resolution scheme to allow multiple readers and writers.
The new idea is to implement the queue using a bunch of individual buckets and have each client write to a different bucket. The bucket would also have a flag to say whether it is being filled or drained. It starts out in the fill state. At some point, either because the bucket is too big or because too much time has elapsed, the bucket gets set to the drain state. Once in the drain state, the indexer starts processing the events in it.
This data structure would be implemented with no conflict resolution support. All the conflicts would cause the client to retry. One challenge is to prevent two different clients from filling the same bucket. You would still expect a conflict per client when a bucket changed from the fill to the drain state.
To avoid conflicts on buckets, you need to have more buckets than clients and a decent hash scheme for assigning clients to an arbitrary bucket -- something like Akamai's consistent hashing scheme. If a conflict does occur, the client getting the conflict needs to pick a different bucket. You could detect either read or write conflicts to make the decision.
If the number of clients is variable, the number of buckets might need to grow over time. I'm not sure how that would be accomplished. Perhaps a client writes its unique id in the bucket. When a new client comes along, it tries some number of times to find an un-owned bucket. If it can't find one, it creates a new bucket and adds it to the queue.
On goal of queued catalog is to collapse multiple operations on a single piece of content, so that you only re-index it once. It may be harder to achieve collapsing with this scheme, because different users will use different ZODB clients for each request. One client's operations could be spread across multiple buckets. Perhaps the indexer drains multiple buckets at the same time to address this. Or maybe there should be some kind of app server affinity in the cluster configuration.
One reason this scheme could work is that the catalog has fairly weak semantics. If you update content, it will be re-cataloged some time after the update, where the application and operators can tune the time versus desired performance.
There must be some literature on these sorts of problems, but I haven't found it yet. Here are two papers that seem close but not quite right:
Maurice Herlihy, 1990. Apologizing versus asking permission: optimistic concurrency control for abstract data types. ACM Transactions on Database Systems, 15, 1 (March), 96-124.
Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and David Culler. Scalable, Distributed Data Structures for Internet Service Construction. Proceedings of the Fourth Symposium on Operating Systems Design and Implementation (OSDI 2000).