Jeremy Hylton :
||last modified Thu Mar 17 01:11:16 2005
XXX undetected error (why=3)
Am I the only one who sees Python complain about WHY_RETURN with an exception set? The default Python installations on my system are all debug builds. They do a lot of error checking that is turned off by default, and run much slower as a result. They also find all sorts of interesting bugs.
One of the checks is for a return from the mainloop -- eval_frame() -- with an exception pending and a non-error why code. The variable why indicates why the block stock is being unwound. Possible values are WHY_NOT (no error), WHY_EXCEPTION (exception occurred), etc.
I hit several of these errors running the Zope 3 test suite today with CVS Python. The last time I saw this error message, Fred Drake and I found a bug in pyexpat.c -- from the checkin message:
Fix several bugs in handling of exceptions with trace function enabled. If the callback raised an exception but did not set curexc_traceback, the trace function was called with PyTrace_RETURN. That is, the trace function was called with an exception set. The main loop detected the exception when the trace function returned; it complained and disabled tracing.
It should be a fatal error for the mainloop to exit with an undetected error. On several occasions, I've noticed the error after it's too late to recover what went wrong. A core dump would be better.
I learned a bit about Technorati tonight. I started with a Cosmos report on the Daily Python URL.
Cosmos tells you about links among weblogs. It reports when a weblog was last updated. The link search reports on links from one weblog to another. You can sort it be date -- the most recently made link first -- or you can sort it be authority. The most authoritative weblog is the one with the most links to it.
How does it decide whether something is a weblog? My first guess would by RSS feed, but the Daily Python URL doesn't have one. It can't be registration, because there are 1,179,981 weblogs watched.
David Sifry, who created Technorati, explains how it works. Among the more interesting notes: It counts links from the weblog home page, not from the archive. So it captures thinks of recent interest. It spiders weblogs, but you can also submit updates. It doesn't appear to have my weblog, which may not be popular enough to be found automatically.
What's the difference between Technorati and Google? There are obvious differences, but the basic idea is similar.
David's weblog entry on the subject is filled with spam posts. I don't see the point of comments posted directly to a weblog. People can post their own comments on their own weblogs -- and software like Technorati can find them.
There is a Technorati API for programs to interact with Technorati. I haven't found an API definition yet. If you sign up as a member, you get a long Technorati key that is good for 500 queries a day.
After I registered as a Technorati user, I tried to "claim" my weblog. I got an email asking me to put a link to Technorati on my weblog home page or my blogroll. (There's a lot of jargon to learn; a blogroll is the customary set of links to other weblogs that you read.) I had to update my weblog software to add the link. I really need to add a template system, but that's a pretty big project.
ZODB Should Use More Memory
We are working on a new ZEO cache design as part of the multi-version concurrency control project for ZODB. Today Tim Peters suggested keeping the cache in memory all of the time.
In our Zope clusters, we tend to run machines with one or two gigabytes of RAM. At one customer site, we noticed that the storage server never used more than a few hundred megabytes of that memory. We had designed the new FileStorage pack implementation to be simple, then optimized it to use less memory. For that customer, at least, we should have used lots more; we could have made it fast instead.
An in-memory ZEO cache is appealing, because you don't want to waste time seeking around a file or files to keep the cache up to date. In normal operation, we expect the cache is full and each time the client fetches a new object, we need to evict old objects to make space for the new one. If it was all in-memory operations, it would be cheap.
One question is whether this works for applications that have more modest system memory. My older desktop machine has 256MB of memory, but my newer desktop and Laptop both have 512MB. What's typical these days? Maybe it doesn't matter, so long as the cache size is tunable.
There's still some benefit to having a persistent cache that survives across processes. If in memory is a win, you could periodically dump a snapshot to disk. Perhaps with an append-only log that would at least record invalidations so that you could quickly toss invalid stuff accumulated between the snapshot and the restart. Of course, on a clean shutdown, you can write a new snapshot.
Chomsky on Splendor
A coincidence. From an interview with Noam Chomsky in the Sunday Times magazine:
I have known people who are working class or craftsmen, who happen to be more intellectual than professors. If you are working 50 hours a week in a factory, you don't have time to read 10 newspapers a day and go back to declassified government archives. But such people may have far-reaching insights into the way the world works.
I saw the movie American Splendor about Harvey Pekar and his comic book American Splendor on Saturday night. Good movie! Pekar worked full-time as a file clerk at the VA hospital in Cleveland, although the movie perhaps under-emphasized his writing as a jazz critic.
Yes, Harvey has a weblog.
Python Software for Weblogs
Weblog software draws on an interesting mix of technologies -- web client and server, XML processing, database or text indexing. I'm collecting a list of Python projects in this space.
That's enough for now. The list is probably incomplete, but I've collected enough to keep my busy for a while.
Werner Vogels reports that the Cornell CS department is dropping subscriptions to all Elsevier journals. The decision comes as the university library is planning to drop several hundred Elsevier journals; Elsevier journals represent 20% of the library's serials budget but only 2% of the number of serials.
Why do people bother submitting to Elsevier journals? Articles freely available online are more highly cited. I can't recall ever reading an article from an Elsevier journal.
Vogels commented on this post. (I revised the first paragraph to clarify the 20% issue.)
I wanted to follow-up on your posting about my Elsevier article; the 20% is not for Computer Science, as you suggest in your writing, but it is for the overall University. Mainly in Arts & Sciences, cognitive sciences, medical school, etc. Elsevier is pretty big. Actually I think there are only a few relevant Computer Science journal published by Elsevier.
I agree with your point about online publishing, and I believe CS is making good direction, the physics people have long ago decided that online review was the way to go with archivex. However it will take a lot of work to convert the 'traditional sciences' to accept a new model. Especially in areas where complete careers depend on a few publications in these journals. We need to think about other acceptable metrics there or maybe overhaul the reward for journal publications system (which would suit me well, as I think it is terribly outdated).
Back from LL3
Just back from a short trip to Boston for LL3 and the Scheme workshop. Jay McCarthy has already posted detailed notes about each talk (also available as OPML.) You can also watch the streaming video.
Clean is a pure, lazy functional language. That puts it in the same company as Haskell, but the type system is different. Might be interesting.
Clean's key feature is a polymorphic uniqueness type inferencing system, a special extension of the Milner / Hindley / Mycroft type inferencing/checking system allowing a refined control over the single threaded use of objects; with this uniqueness type system one can influence the time and space behavior of programs; it can be used to incorporate destructive updates of objects within a pure functional framework, it allows destructive transformation of state information, it enables efficient interfacing to the non-functional world (to C but also to I/O systems like X-Windows) offering direct access to file systems and operating systems;
Another List of the 100 Greatest Novels
Just saw a link to the latest list of the 100 greatest novels. It is unusual for listing so many early novels at the top -- Don Quixote, Pilgrim's Progress, Robinson Crusoe. It's also unusual for only having one novel by an author; that keeps Jane Austen and Charles Dickens from crowding everyone else out.
Edith Grosssman's new translation of Don Quixote has been getting very good reviews. I looked for it last weekend for my train ride, but couldn't find a copy.
I've read 18 of the 100 books from The Charterhouse of Parma, the highest ranked book I've read, through Atonement, the book on the list I've read most recently. I enjoyed Parma, but I don't think I would rank it above Huck Finn or Anna Karenina.
Brian Reynolds kindly explained my confusion about the ranking of the list. It is a chronological list rather than a best-first list. Of course Don Quixote is first! (I got a copy of Christmas and it's next on my reading list.)
2003 Scheme Workshop
My overall impression of the workshop is that the presentations were interesting and well-done, but didn't have a lot of sizzle. Scheme has been around for a long time (closing in on 30 years) and is very well understood in theory and in practice. The talks seemed to be exploring several small corners of the Scheme world.
The workshop ended with a group discussion of Scheme standardization. The plan presented and largely agreed upon there was to appoint editors for a new Scheme standard within two weeks.
I was in Boston for the LL3 workshop on Saturday. We planned to have the two events on back-to-back days to allow people to attend both, and it was convenient for me at least. I wouldn't have gone to a Scheme workshop otherwise.
Richard Kelsey gave an excellent talk on the interaction between
dynamic-wind; basically, if you
want to use continuations for threads, you can't use
call/cc because it invokes
thunks. One conclusion from the paper: "It is better to build
first-class continuations and dynamic-wind on top of a native thread
system rather than building the thread system on top of
continuations." This didn't come through as directly in the talk; too
bad there wasn't any debate of it.
Greg Cooper's demo of FrTime (pronounced "father time") was the highlight of the day. He has implemented a functional reactive programming system for PLT Scheme. It allows values in the language to vary with time (as opposed to having an elaborate structure for polling for new values from some external source). When he showed an interactive interpreter prompt where the value of current-seconds updated itself every second in the history, he got a spontaneous round of applause from the audience.
There were a dozen talks, but I don't have time to write up my notes on all of them. Here are a few of the highlights.
Daniel Silva presented his work on embedding Python in Scheme. The Python source is translated into Scheme source. It's just a proof of concept at the moment, running many times slower than the Python interpreter. Most of Python is too dynamic to translate directly into similar features in Scheme. MzScheme has modules and classes, but Python's module and classes are implemented using namespaces and hash tables.
One concern for the project is integration with the runtime system
provided by the Python interpreter. A goal here was to make Python C
extensions available in the Scheme environment, but Python's use of
direct access to the
ob_type field defined by
PyObject_HEAD is a major stumbling block.
Matthias Felleisen, who advises the students, told me he was disappointed that their static debugger, MrFlow, did not work well with Python. The value flow analysis that MrFlow performs does not work well with Python's very dynamic style. Since an class or instance's dict can be manipulated directly, the flow analysis can't infer much beyond access to the underlying hash table. Typical Python coding style favors an imperative rather than functional style, where objects are modified in place and methods don't return anything. Felleisen didn't like that either.
Python can be very dynamic, but in practice changes to classes and objects at runtime via __dict__ is fairly rare. It's too bad you can't partition the analysis into two parts -- one ignoring runtime introspection and another part that attempts to identify the risks of that introspection. For example, ZODB does a lot of dynamic modification of objects, but only to migrate object state in and out of the database.
I was a little surprised that the focus was on close integration with the existing Python interpreter. I would have guessed that you could translate Python to pure Scheme and then compile the code to get good performance.
Danny Dubé gave the PICBIT talk -- a Scheme implementation for the pic microcontroller. The chief challenge was to get programs running with less than 2KB of RAM.
It was a good nuts-and-bolts talk. How did they do it? A compact object representation, pre-allocate a bunch of ints in ROM. They use the Deutsch-Schorr-Waite mark-and-sweep collector. Stop the word isn't too costly when the world is only 2K.
They showed some space results for a bunch of small programs, including a 600-line Earley parser and an 800-line Scheme interpreter. ROM size ranged from 20 to 35KB. RAM size was a little over 2KB for Earley, but other things fit.
Peter Walton Hopkins presented an extension to the basic Web UIs as continuations work for the PLT Scheme web server. The framework provides a singlue continuation for each page. This becomes hard to use when a page has many links; you need to dispatch to the code that handles the particular action performed via the continuation URL.
send/suspend/dispatch gives you a continuation plus a closure, which makes it easier to manage the dispatch code. I'm not terribly familiar with the original send/suspend framework, but the talk was interesting enough that I'd like to investigate.
Programming the Web with High-Level Programming Languages ( PDF), from the 2001 European Symposium on Programming, was cited, as was Queinnec's work on continuations and web servers. He also had a recent article in SIGPLAN Notices.
This talk was interesting, but it was given while I was dragging after lunch. Definitely worth reading the paper. During the discussion, Shriram Krishnamurthi asked two good questions. First, aren't you just recreating BNF; the same question had occurred to me, but I didn't catch any of Matthias's answer beyond "No." The other question was about the complexity of checking, given experience with XML transformers. Here Matthias explained that XML was big, but macros were small.
The new plan is to have a steering committee (of gray beards) and a small group of editors to draft a new standard. The session was run by Alan Bawden; he and Will Clinger did most of the talking.
The existing standard series, ending with R5RS, was compsed by a group knows as the "authors group." It seems this was basically whoever happened to come to an early meeting of people interested in Scheme. It includes people who aren't actively working on Scheme anymore and excludes many people who are.
The biggest problem with the authors group was that decisions were unanimous, which meant that a small minority. As a result, R5RS does not address a lot of important issues. I've never written large scheme programs (outside coursework), but it seems that the lack of a module system is a key issue. perhaps in the absence of the module system, there don't seem to be a lot of modules.
I checked later and was surprised to see just how many Scheme implementations have enough life to get mentioned in the FAQ. There are perhaps a dozen implementations I've heard of; the full list is in the FAQ. (And this doesn't count all the people who did a little one as a hack, like Cotton Seed's implementation in Display Postscript for the NeXT.)
Why do they think they'll make progress? In the past, they've been stuck by disagreements over basic design issues. Either people are going to have agree to go along with individual decisions they don't like, or the people who had prevented progress are no longer are part of the process. Regardless, the standard will only work if people implement it and write code against it. Best of luck.
Object databases have been around for a long time. There aren't a lot of new ideas in ZODB; it's just a good collection of old ideas. I should look more closely at other current work in this area. There is some interesting open source work.
ZODB is remarkably similar to ObjectStore PSE for Java, although ZODB is distributed and the PSE is not. PSE is not open source, so there's little to look at.
GOODS is a generic object database with bindings for C++, Java, Perl, and Smalltalk.
Ozone is an object database for Java.
Spambayes False Positive
An email I sent Tim was scored as 100% spam by his Spambayes filter. The email contained lots of charts and text graphics summarizing the contents of a Zope database. Most of the numbers were spam clues.
The message did contain a few good ham clues -- like my email address and words like "class" -- but an overwhelming number of spam clues. Many of the three digit numbers in the report were mild spam indicators -- seen in three spam and one ham or two spam and no name. Tim noted that IP addresses often show up in URLs in spam, so the individual components are probably in the training data.
He trained on one of the messages and the next report email came up as 100% ham.
Grouping Bibliographic Records
Jon Udell points to a D-Lib article about grouping bibliographic records into works and expressions. The problem is that you search for Hamlet and get many results back -- different editions, different movies, a copy within a collected works, etc. They all show up as different entries even though at some abstract level they are the same. How can we make effective interfaces for searching that exploits those relationships?
This was the subject of my master's thesis, Identifying and Merging Related Bibliographic Records. I looked at a narrower problem identifying duplicates and related items in the computer science publications.
The article by Thom Hickey, chief scientist at OCLC, and Edward O'Neill and Jenny Toves reports on work undertaken in response to the Functional Requirements for Bibliographic Records report. The article says:
The most innovative part of the report dealt with the first group of entities, describing the hierarchical relationships that cluster bibliographic items into manifestations, expressions and works.
At the end of the article, they note: It's implemented in Python and uses Twisted. Good for them! My thesis work was in Perl, the first and last substantial project I wrote in Perl.
They have a fairly straightforward approach that relies on high quality bibliographic records and authority files. Step One in creating a "work set" is to use the normalized primary author and title. I was working with records in bibtex and RFC 1357 formats, where errors in author and title were routine. In the absence of record or title authorities, I used n-grams to find approximate matches. One of my conclusions was:
Using authority control to regularize the use of certain fields, notably author, journal, and publisher, would improve the quality of the records visible to the user and, in the case of the author field, the quality of the clustering algorithm. (Recall that problems parsing and comparing author lists accounted for most of the missed matches during clustering.) A system for authority control, however, would have to deal with some of the same problems the clustering algorithm handles now; it needs to identify as many variant entries as possible without being so aggressive that truly different entries are conflated.
Hickey's article describes a much nicer user interface than I actually implemented. In particular, they address how to compare work sets.
A Small Collection of XML Links
A handful of articles about XML processing in Python.
Python at LinuxWorld Expo
Steve Holden and I are speaking at the LinuxWorld Expo in January at the Javits Center in New York City. Steve is giving his popular Network Programming in Python course. I am talking about Programming Weblogs with Python.
ZEO Cache Effectiveness
I just wanted to confirm that loading objects from the ZEO cache is faster than loading them from the ZEO server. With a fast network, you might think we could read data from the ZEO server as fast as we could read it off disk. Not close, though. The cache looks to be 40x faster.
We have probably made a lot of decision decisions that favor simplicity over performance for the ZEO server. If an object is sitting in memory on the server, couldn't we send it to the server faster than we could read it from disk? The protocol is one problem. The data gets pickled as part of the encoding process, then it is chunked up by asyncore. As a result, we make several copies of the data as it is sent. The receiver has to do the same. Neither cPickle nor asyncore is that fast compared to copying bytes over the network.
I did a simple test with a 1.5GB Data.fs, which means that the whole file can't sit in the machine's buffer cache. (It's only got 1GB of RAM.) I chose 250 random oids and loaded each object 11 times. I compared the median times of cached lookup and a direct zeoLoad() call to the storage server. Note that zeoLoad has a counter-productive optimization for locked versions, causing it to read the object's header twice. It will hit in the cache the second time, but the struct.unpack() call is probably still slow.
For the 250 oids, the cache lookup is about 40 times faster. A read from the cache takes about 80 microseconds. A read from the server takes about 3 milliseconds. The server was completely idle when I made these measurements. If the server was heavily loaded, reads would take longer.
Proofs, Types, and Java Integration
The New Jersey Programming Lanuages Seminar met at Penn last Friday to hear five talks. The highlight was Andrew Appel's talk on proof-carrying code. I should have taken notes, because the slides don't seem to be available.
The basic notion of proof-carrying code is that a trusted system could run untrusted code if the untrusted code provided a proof that it respected the necessary safety policy. While it may be hard to construct a proof, it is straightforward to check its correctness. Necula and Lee first presented that idea at OSDI '96 in Seattle. The trusted computing base (TCB) of a PCC system includes the code to verify the proof is correct.
The project Appel described is heroic. A verification system must map between machine instructions and their safety effects. His group is constructing a large, machine-checkable proof of a simple memory safety property for the Sparc processor. If I recall correctly, the largest part of the proof (more than 100,000 lines) is a proof that their typed assembly language is correct. That proof can be mechanically checked and thus omitted from the TCB. I've probably garbled the details, but the end result is a TCB of a few thousand lines of code -- small enough that it might be checked by hand.
David Sands talked about Controlled Downgrading based on Intransitive (Non)Interference. Like Appel's talk, it was an excellent presentation of the material. In a security system based on information flow, downgrading is necessary to make the system usable. Sands presented a language where downgrade operations were marked with brackets, allowing you to identify what is downgraded and how much information is leaked as a result. The type system guarantees that you only leak information that is explicitly identified as downgraded. The notion of "intransitive non-interference" comes from Roscoe and Goldsmith.
Mukund Raghavachari gave a talk about XJ a language integrating Java and XML from IBM research. It was a good audience for the talk, because Benjamin Pierce's group at Penn also works in this area (Xtatic and XDuce). Two parts of XJ are to make XML elements into Java objects based on a schema and to integrate XPath expressions into the Java source. It wasn't really an XML binding, as the XML objects did not have data attributes; to access a child element, you needed to use an XPath expression.
The other two talks were on specifying high-level communications patterns and on inferring lock types to avoid races in Java programs. (My summaries are shorter at least in part because I've about used up the time I have to write them.) The communications patterns talk used an ATM machine as its example. The correspondence assertion allows you to say that money deposited at the ATM actually reaches the bank, where the user, ATM, and bank are represented as separate processes. I wonder how the system handles failures; you really want to say that the money winds up at the bank or back in your hands, but I'm not sure how the technique handles that.
Boyapti and Rinard developed ownership types, which have been used to prevent data-races and deadlocks in multi-threaded programs and to reason about upgrades in persistent object stores. Rahul Agarwal talked about a system to infer the type annotations by tracing test executions of the un-annotated program. A nice feature of this approach is that you don't need to have great coverage in the test executions. It infers types that are then checked statically. You still get annotations for code you didn't execute. If that code was significant, I assume the annotated program would not compile. At lunch, Rahul said the technique did not extend to locking disciplines like condition variables. For condition variables, it was too hard to tell what wait would be resumed as the result of a notify.
I had lunch with a group of people from Penn and Maryland. Jeff Foster from Maryland told me a little about CQUAL a tool for analyzing type annotations for C programs. One of Pierce's students (sorry, I forgot your name) told me a little about Harmony, a project studying synchronizers for tree-structured data (like XML).
ZODB Serial Numbers
In the new multi-version concurrency control branch, the serial number seems to be growing irrelevant. Transaction ids are used directly in all the new code. Unfortunately, serial numbers are still necessary to support versions -- in particular, aborting a version.
The difference between a serial number and a transaction id only matters in a very small number of cases. Normally the serial number for each object in a transaction is the transaction's id. If the transaction is an abort version, then the objects get serial numbers that correspond to the serial number of the non-version data from an earlier transaction. One reason for this complication is to handle the original ZEO cache verification algorithm, which worked on serial numbers and didn't want to invalidate its current non-version data just because a version was aborted. The other reason is that a transaction that commits using the non-version data does not conflict with the abort version operation.
The only other place serial numbers are used is to detect conflicts. When a transaction stores a new revision of an object, it passes the previous serial number to the storage. If that previous serial number doesn't match the current serial number, a conflict has occurred -- another client has committed its changes first.
In several places for MVCC we need to know the starting and ending transaction ids for a revision. In most cases, this starting transaction is the same as the serial number. I wish there were some way to avoid the need for a separate serial number -- like a flag that was set only for abort version records. The server would have to do extra work to figure out whether a conflict really exist for an abort version, then. If we allowed abort version to generate spurious conflicts, we could simplify all the APIs.
In the details of the FileStorage implementation, serial numbers are tricky to manage because of backpointers. A data record contains a backpointer for undo and version operations, because they don't create a new copy of the data; they just point to an earlier record that does have the data. As a result, the implementation is really dealing with two data records, and you have to decide whether to use the serial number from the current record or the old record. FileStorage always uses the current serial number, which avoids a bunch of seeks+reads to resolve backpointers at store time. As a result, you have to copy the old serial number into the new record for abort version. Would it be possible to always pass the old serial number and then do a little more work for store calls?
The new loadNonCurrent call is returning the old serial number rather than the new one. That seems like the only sensible thing for Berkeley storages to do, but it's inconsistent with plain-old load.
Simple Definition of Web Services
The draft of Werner Vogel's article "Web Services are not distributed objects" has a good, simple definition of web services and explains how it isn't RPC or distributed objects. Too bad the final draft is not freely available.
Serial Numbers Resolved
After my musings last night, we decided to make serial numbers exactly equal to transaction ids in all cases. The apparent optimization for the rare abort version is eliminated, but other logic is simplified a lot.
ZEO Cache Strategies
This note summarizes some work on the design of a new cache for ZEO clients. The two key issues appear to be the cache replacement policy and managing persistent storage of the cache. A replacement algorithm like 2Q [Johnson] seems to match our data. Not sure about persistent storage yet. Tim Peters did a lot of the leg work tracking down relevant papers and summarizing them; this is just a recap.
The ZEO cache plays a crucial role in scaling large Zope clusters. It is much faster to read data from the ZEO cache than from the server, and the server can be heavily loaded with writes under heavy load. So we want the cache to absorb a lot of client reads.
Two areas of research relevant to this problem are buffer caches and Web proxy caches.
It would make sense to use memory for this cache, because machines have so much main memory. For Zope, it doesn't work. A Zope app server uses a lot of memory -- several hundred MBs is not unusual. We don't have memory to spare for the second-level object cache.
In ZODB a top-level object cache exists for each thread, because they each have independent views of the database. Even unmodified objects need to be shared in our system, because we're not using any virtual memory tricks. These caches count number of objects rather than size of objects, so a few large objects can use a lot of memory. It's almost certain that there a bunch of leaks in Zope, too.
A strict LRU algorithm probably isn't the best choice.
The ZEO cache is a second-level cache, which means that access to frequently used docs should hit mostly in the upper-level cache. In the cache trace data I've looked at so far, about 25% of the objects are referenced once. More than half of the references are for 17% of the objects.
The objects are of varying size -- from a few tens of bytes to many MBs. Most of the objects are small: For one customer database, 50% of the objects are 384 bytes or smaller and 89% are 2K or smaller.
One complication is that other clients are modifying objects. Objects are being replaced and other current objects are being removed for consistency. Most of the cache replacement algorithms seem to focus on uniprocessor main memory, where modified objects are written through to the cache. Our problem is closer to a multiprocessor cache coherence with a write update protocol, but I didn't find any papers on this subject.
The multi-queue algorithm [Chen] is for second level buffer caches on servers. The difference between client and server is that most of the frequent accesses are absorbed by caches on the client. The temporal distance between access on the server is fairly large. Larger that what we're seeing with a ZEO client cache. A later paper suggests writing objects to the second-level cache when they are evicted from the first-level cache instead of when they are loaded into it.
The simpler 2Q algorithm [Johnson] looks promising, because it will handle the many objects that a referenced once. A later paper [Lee] points out some problems with 2Q, but I haven't read it yet.
I'd also like to look at caching strategies that take size into account. The greedy dual size algorithm [Cao] for web proxy caches is the best known example. I'm also going to read the paper about the LRU-SP algorithm [Cheng].
A database is slow, and storing in individual files is slow. Instead, you want to use a single file or a small number of files and manage storage within it. If performance is a goal and portability isn't, you could also use a raw disk. Iyengar discusses the issue at length.
Tim has some ideas about how to implement this efficiently, but we're waiting for decent trace data to drive some simulations. The basic idea of log-structured file systems [Rosenblum] are relevant, because it minimizes the cost of writing new data to the cache. We don't need complicated cleaning heuristics, because we can always toss data and get it from the server later.
Pei Cao and Sandy Irani. Cost-Aware WWW Proxy Caching Algorithms Proceedings of the Usenix Symposium on Internet Technologies and Systems (USITS), 1997.
Zhifeng Chen, Yuanyuan Zhou, and Kai Li. Eviction Based Placement for Storage Caches. Proceedings of the 2003 Usenix Technical Conference.
Kai Cheng and Yahiko Kambayashi. LRU-SP: A Size-Adjusted and Popularity-Aware LRU Replacement Algorithm for Web Caching. Proceedings of the 24th IEEE Computer Software and Applications Conference, 2000.
Arun Iyengar, Shudong Jin, and Jim Challenger. Techniques for efficiently allocating persistent storage. Journal of Systems and Software, Vol. 68 (2003): 85--102.
Theodore Johnson and Dennis Shasha. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. Proceedings of the 20th Conference on Very Large Databases (VLDB), 1994.
Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H. Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Jim. On the Existence of a Spectrum of Policies that Subsumes the Least Recently Used (LRU) and Least Frequently Used (LFU) Policies. Proceedings of ACM Sigmetrics 1999.
Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems, Vol. 10 (1992): 1, 26--52.
Yuanyuan Zhou, James F. Philbin, and Kai Li. The Multi-Queue Replacement Algorithm for Second Level Buffer Caches. Proceedings of the 2001 Usenix Technical Conference.
Ironic that a few weeks after saying I've never read anything from an Elsevier journal, I found an article published in an Elsevier journal. At least it was available online, too.
Simulating a 2Q Object Cache
I used three simulators to look at a 36-hour trace from zope.org this afternoon. I compared a modified 2Q algorithm against the existing algorithms. It did better; the hit rate for a 100MB cache was almost equal to the hit rate for a cache of unbounded size. Not sure if 2Q can be implemented effectively, though.
I'm using the simul.py from the Zope 2.6 branch, because zope.org is running 2.6; the cache format is different for 2.7 and up. Jens gave me a cache trace containing 36 hours of zope.org data.
Three simluations are
I checked in the 2Q implementation. The original 2Q algorithm was simple because it dealt with fixed size blocks. I had to make a bunch of small changes to deal with variable sized blocks. There are also odd corners: When you store data that is currently in the cache, I allow the cache to exceed its size limits until the next cache miss.
None of the simulation deals with writing to real files. It's easy to handle the A1in cache -- things read recently but not promoted to the top-level cache -- because it's just FIFO. I assume A1out -- list of oids that fell off the end of A1in -- can be written out when the cache shuts down.
It's the top-level LRU cache that's tricky. I don't know if there's a good way to implement an LRU cache on top of a file. We'll probably need to simulate one of Tim's rotating-file-pointer schemes and see how much it hurts.
I simulated with a 100MB cache. Everything fits without eviction or flip in a 500MB cache, which might make the details of the replacement algorithm moot.
Each line in the simulation below represents a snapshot of data between client restarts. I don't know why the client restarts so often. There are a few bursts of heavy write activity in the traces -- 10:21 and 20:46 on Nov 25. The 2Q cache has a much higher hit rate for these segments -- 58% and 54% vs. 37% and 29%. I expect it's doing a better job of keeping hot objects in the cache, which it can do because it's got an idealized LRU component.
START TIME DURATION LOADS HITS INVALS WRITES FLIPS HITRATE Nov 25 00:36 2:19:55 51340 18759 68 116 2 36.5% Nov 25 02:56 1:55:01 54952 18481 33 442 3 33.6% Nov 25 04:51 1:30:01 72530 28127 55 97 3 38.8% Nov 25 06:21 1:00:13 46689 16871 19 111 2 36.1% Nov 25 07:21 1:45:09 53628 20045 64 143 3 37.4% Nov 25 09:06 1:15:02 58238 20881 28 163 3 35.9% Nov 25 10:21 2:05:01 126466 48034 604 4751 7 38.0% Nov 25 12:26 1:15:03 51919 18122 29 68 2 34.9% Nov 25 13:41 2:45:03 69965 22808 130 296 4 32.6% Nov 25 16:26 4:19:34 37131 12712 3 38 2 34.2% Nov 25 20:46 6:05:06 110716 34904 174 2605 7 31.5% Nov 26 02:51 2:04:57 79765 27891 507 1157 5 35.0% Nov 26 04:56 40:01 53910 19181 93 187 3 35.6% Nov 26 05:36 1:15:11 51677 17314 12 456 3 33.5% Nov 26 06:51 1:00:08 51355 18301 27 92 2 35.6% Nov 26 07:51 1:04:56 51923 18271 5 68 2 35.2% Nov 26 08:56 1:05:13 66551 24047 12 90 3 36.1% Nov 26 10:01 50:04 48137 17930 16 80 2 37.2% Nov 26 10:51 28:00 40506 16626 13 70 1 41.0% Nov 25 00:36 34:43:38 1177398 419305 1892 11030 59 35.6% OVERALL
START TIME DURATION LOADS HITS INVALS WRITES FLIPS HITRATE Nov 25 00:36 2:19:55 51340 16862 52 116 3 32.8% Nov 25 02:56 1:55:01 54952 16433 34 442 3 29.9% Nov 25 04:51 1:30:01 72530 26759 51 97 5 36.9% Nov 25 06:21 1:00:13 46689 16458 17 111 3 35.3% Nov 25 07:21 1:45:09 53628 18971 53 143 3 35.4% Nov 25 09:06 1:15:02 58238 20556 28 163 3 35.3% Nov 25 10:21 2:05:01 126466 46146 552 4751 9 36.5% Nov 25 12:26 1:15:03 51919 17269 26 68 3 33.3% Nov 25 13:41 2:45:03 69965 21461 138 296 5 30.7% Nov 25 16:26 4:19:34 37131 12409 3 38 2 33.4% Nov 25 20:46 6:05:06 110716 31716 171 2605 9 28.6% Nov 26 02:51 2:04:57 79765 27015 478 1157 6 33.9% Nov 26 04:56 40:01 53910 18012 86 187 3 33.4% Nov 26 05:36 1:15:11 51677 16902 12 456 3 32.7% Nov 26 06:51 1:00:08 51355 16659 27 92 3 32.4% Nov 26 07:51 1:04:56 51923 17963 5 68 3 34.6% Nov 26 08:56 1:05:13 66551 24446 14 90 4 36.7% Nov 26 10:01 50:04 48137 17694 18 80 3 36.8% Nov 26 10:51 28:00 40506 16581 13 70 2 40.9% Nov 25 00:36 34:43:38 1177398 400312 1778 11030 75 34.0% OVERALL
START TIME DURATION LOADS HITS INVALS WRITES EVICTS HITRATE Nov 25 00:36 2:19:55 51340 17629 142 116 23103 34.3% Nov 25 02:56 1:55:01 54952 20390 53 442 21426 37.1% Nov 25 04:51 1:30:01 72530 34056 71 97 11500 47.0% Nov 25 06:21 1:00:13 46689 18223 23 111 22910 39.0% Nov 25 07:21 1:45:09 53628 21960 91 143 10709 40.9% Nov 25 09:06 1:15:02 58238 23473 39 163 22095 40.3% Nov 25 10:21 2:05:01 126466 73543 822 4751 32547 58.2% Nov 25 12:26 1:15:03 51919 19069 51 68 14365 36.7% Nov 25 13:41 2:45:03 69965 31514 298 296 32707 45.0% Nov 25 16:26 4:19:34 37131 12783 3 38 23743 34.4% Nov 25 20:46 6:05:06 110716 59497 286 2605 32389 53.7% Nov 26 02:51 2:04:57 79765 37405 671 1157 30926 46.9% Nov 26 04:56 40:01 53910 21659 120 187 24326 40.2% Nov 26 05:36 1:15:11 51677 20213 20 456 12664 39.1% Nov 26 06:51 1:00:08 51355 19113 29 92 18965 37.2% Nov 26 07:51 1:04:56 51923 19772 17 68 9988 38.1% Nov 26 08:56 1:05:13 66551 28465 15 90 18300 42.8% Nov 26 10:01 50:04 48137 19193 20 80 15014 39.9% Nov 26 10:51 28:00 40506 16625 12 70 0 41.0% Nov 25 00:36 34:43:38 1177398 514582 2783 11030 377677 43.7% OVERALL
Saw a link to a brief history of the Strongtalk system, a Smalltalk implement with static types and a runtime system based on type-feedback. The company was eventually bought by Sun to work on Hotspot.
Several people had tried to build type systems for Smalltalk (Borning, Palsberg & Schwartzbach, Graver & Johnson), but it was clearly an enormously difficult task, because of the vastly more flexible nature of the way Smalltalk is used compared to any existing statically-typed language, not to mention the unprecedented problem of having to retrofit a type system onto existing untyped code.
I would think the same kind of project could be done for Python. It's a similarly dynamic system and a JIT based on type-feedback would probably be as effective. So what kind of effort would be involved? It looks like they had six talented people with previous experience working for almost two years to get the first version done.
ARC Policy for Cache Replacement
Taj Khattra passed along a reference to a new cache replacement algorithm from IBM that was presented at the FAST conference this spring. The algorithm is conceptually simple, has low overhead, and the paper shows good results for a wide variety of workloads. I'll run it through our simulator on Monday.
The basic idea is to keep two LRU lists, one of pages seen once recently and one of pages seen twice recently. The cache keeps about twice as many objects in the lists as it can hold in memory. It keeps a variable number of objects from each list in memory, varying that number to track the current workload.
I looked at the greedy dual-size paper, but it's not obvious that it is a good fit. The implementation requires a priority queue, so replacement is O(log k), where k is the size of the cache. The algorithm considers both the size of an object and the cost to fetch it, where we assume a uniform cost.
Irani has a more theoretical paper on caching with multi-size pages. It's fairly dense, but has a couple of important ideas early on. First, I haven't thought about what the cost model should be. We've been using hit rate, but we might consider bit rate -- the number of bits that don't have to be fetched rather than the number of objects. Second, computing an optimal replacement policy for bit rate cost with multi-size objects is NP hard (according to Fiat); it's unknown whether the hit rate model is in P.
Pei Cao and Sandy Irani. Cost-Aware WWW Proxy Caching Algorithms In Proceedings of the Usenix Symposium on Internet Technologies and Systems (USITS), 1997.
Sandy Irani. Page Replacement with Multi-Size Pages and Applications to Web Caching. Algorithmica, vol. 33, no.3, July 2002, pp. 384-409. A preliminary version appeared in Proceedings of the 29th Symposium on the Theory of Computing, 1997.
Nimrod Megiddo and Dharmendra S. Modha. ARC: A Self-Tuning, Low Overhead Replacement Cache (PDF). In Proceedings of the USENIX Conference on File and Storage Technologies (FAST 03), 2003.