|Jeremy Hylton : weblog : 2003-11-26|
Wednesday, November 26, 2003, 12:32 a.m.
This note summarizes some work on the design of a new cache for ZEO clients. The two key issues appear to be the cache replacement policy and managing persistent storage of the cache. A replacement algorithm like 2Q [Johnson] seems to match our data. Not sure about persistent storage yet. Tim Peters did a lot of the leg work tracking down relevant papers and summarizing them; this is just a recap.
The ZEO cache plays a crucial role in scaling large Zope clusters. It is much faster to read data from the ZEO cache than from the server, and the server can be heavily loaded with writes under heavy load. So we want the cache to absorb a lot of client reads.
Two areas of research relevant to this problem are buffer caches and Web proxy caches.
It would make sense to use memory for this cache, because machines have so much main memory. For Zope, it doesn't work. A Zope app server uses a lot of memory -- several hundred MBs is not unusual. We don't have memory to spare for the second-level object cache.
In ZODB a top-level object cache exists for each thread, because they each have independent views of the database. Even unmodified objects need to be shared in our system, because we're not using any virtual memory tricks. These caches count number of objects rather than size of objects, so a few large objects can use a lot of memory. It's almost certain that there a bunch of leaks in Zope, too.
A strict LRU algorithm probably isn't the best choice.
The ZEO cache is a second-level cache, which means that access to frequently used docs should hit mostly in the upper-level cache. In the cache trace data I've looked at so far, about 25% of the objects are referenced once. More than half of the references are for 17% of the objects.
The objects are of varying size -- from a few tens of bytes to many MBs. Most of the objects are small: For one customer database, 50% of the objects are 384 bytes or smaller and 89% are 2K or smaller.
One complication is that other clients are modifying objects. Objects are being replaced and other current objects are being removed for consistency. Most of the cache replacement algorithms seem to focus on uniprocessor main memory, where modified objects are written through to the cache. Our problem is closer to a multiprocessor cache coherence with a write update protocol, but I didn't find any papers on this subject.
The multi-queue algorithm [Chen] is for second level buffer caches on servers. The difference between client and server is that most of the frequent accesses are absorbed by caches on the client. The temporal distance between access on the server is fairly large. Larger that what we're seeing with a ZEO client cache. A later paper suggests writing objects to the second-level cache when they are evicted from the first-level cache instead of when they are loaded into it.
The simpler 2Q algorithm [Johnson] looks promising, because it will handle the many objects that a referenced once. A later paper [Lee] points out some problems with 2Q, but I haven't read it yet.
I'd also like to look at caching strategies that take size into account. The greedy dual size algorithm [Cao] for web proxy caches is the best known example. I'm also going to read the paper about the LRU-SP algorithm [Cheng].
A database is slow, and storing in individual files is slow. Instead, you want to use a single file or a small number of files and manage storage within it. If performance is a goal and portability isn't, you could also use a raw disk. Iyengar discusses the issue at length.
Tim has some ideas about how to implement this efficiently, but we're waiting for decent trace data to drive some simulations. The basic idea of log-structured file systems [Rosenblum] are relevant, because it minimizes the cost of writing new data to the cache. We don't need complicated cleaning heuristics, because we can always toss data and get it from the server later.
Pei Cao and Sandy Irani. Cost-Aware WWW Proxy Caching Algorithms Proceedings of the Usenix Symposium on Internet Technologies and Systems (USITS), 1997.
Zhifeng Chen, Yuanyuan Zhou, and Kai Li. Eviction Based Placement for Storage Caches. Proceedings of the 2003 Usenix Technical Conference.
Kai Cheng and Yahiko Kambayashi. LRU-SP: A Size-Adjusted and Popularity-Aware LRU Replacement Algorithm for Web Caching. Proceedings of the 24th IEEE Computer Software and Applications Conference, 2000.
Arun Iyengar, Shudong Jin, and Jim Challenger. Techniques for efficiently allocating persistent storage. Journal of Systems and Software, Vol. 68 (2003): 85--102.
Theodore Johnson and Dennis Shasha. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. Proceedings of the 20th Conference on Very Large Databases (VLDB), 1994.
Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H. Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Jim. On the Existence of a Spectrum of Policies that Subsumes the Least Recently Used (LRU) and Least Frequently Used (LFU) Policies. Proceedings of ACM Sigmetrics 1999.
Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems, Vol. 10 (1992): 1, 26--52.
Yuanyuan Zhou, James F. Philbin, and Kai Li. The Multi-Queue Replacement Algorithm for Second Level Buffer Caches. Proceedings of the 2001 Usenix Technical Conference.
Ironic that a few weeks after saying I've never read anything from an Elsevier journal, I found an article published in an Elsevier journal. At least it was available online, too.