Jeremy Hylton :
||last modified Thu Mar 17 01:11:16 2005
Ethan Berkove gave a fun talk titled "Mathematics and Origami" today. He covered several interesting mathematical problems related to folding. He also brought some impressively detailed pieces, including an eagle and a leopard.
The first problem was folding a paper into thirds, which is equivalent to trisecting an angle or duplicating the cube. The latter two are classic problems from ancient Greece; it is impossible to solve with a straight edge and compass. It is possible via folding.
The key idea is to fold two points to two lines. This provides a method to trisect an angle. Tom Hull has some good notes on this kind of geometric construction. Berkove mentioned Hull as one of the leading researchers in the field.
The next problem was flat folding. Given a pattern of creases on a flat sheet of paper, can the paper be folded flat? There are two kinds of creases, called mountain and valley creases, depending on whether the paper runs up or down from the crease. The Kawasaki condition says that the sum of alternating angles around a single vertex is always pi in a flat fold.
Not all crease patterns will fold flat. The example he showed involved a triangle. The three vertexes involved satisfying the Kawasaki condition, but aren't compatible with each other. In general, deciding whether a set of creates will fold flat is NP-hard. Again, Hull has a good page on flat folds.
The third problem had to do with folds and cuts. He started with this simple problem: Can you fold a square piece of paper so that you can cut a smaller square out of the middle of it with a single, straight cut? Yes. Fold on the diagonals and cut the end off the resulting triangle.
Berkove mentioned on solution to the fold-and-cut problem, due to Erik Demaine, Martin Demaine, and Anna Lubiw. Any planar graph can be cut out by making some set of folds and a straight cut.
The last problem had to do with coloring. He showed several nice pieces that he had made. One was a soccer ball frame made with red, white, and blue paper. I didn't take good notes here, but he related it to the Four Color Theorem. See Hull's color notes, too.
I wish I had asked a question about the design techniques for the very intricate origami pieces he showed. He mentioned that a chief challenge was extracting corners; the eagle had six corners (feet, wings, beak, and one other). The challenge is to "pull corners out of a flat piece of paper."
What real errors does pychecker catch?
PyChecker is an invaluable tool, but its warns about too many things that are only sometimes a problem. I have found many bugs in Zope code using pychecker, but it takes extra effort to filter out all the false positives.
We have integrated PyChecker with the Zope test runner. If you run the tests with the -c argument, PyChecker is loaded and reports warnings, too.
I ran the ZODB3 test suite today with PyChecker enabled. It found several problems. Only one of the problems was very serious, but even the minor ones makes the code harder to read. I decide to keep track of which warnings were most helpful and which were most often spurious.
The three most helpful warnings were:
Unused local variables is a good warning, because they almost always mean there is code you can delete! Zope has a lot of local variables created for dubious optimization purposes. One side-effect is that useless local variables live on as the code changes.
Two warnings were all false positives so far as I could tell. There were a lot of warnings, and I only looked at a few before giving up.
No class attribute found. This happens a lot when you program with mixin classes and the mixin calls methods defined in the concrete class. The class itself doesn't define the method, but the classes that are actually instantiated do.
Parameter not used. If you multiple implementations of an interface, it is not always the case that all the arguments are used. As one example, there is a ZEO class that implements a getattr hook that always raises an exception. pychecker complains that the attribute argument is not used. This implementation of __getattr__() does not need the argument, but other implementations probably do. There are many examples in ZODB where there is a minimal implementation on an interface that ignore some arguments.
R.W. Apple, Jr.
There's a terrific profile of R.W. (Johnny) Apple, Jr. by Calvin Trillan in last week's New Yorker. Apple is an associate editor of the New York Times and a character. He is a larger than life character -- a great reporter and an expert on food, wine, architecture, and everything else -- and the profile is full of entertaining anecdotes.
Apple has been writing about food lately. I long dispatches from Philadelphia and Baltimore earlier this year. The Baltimore piece merits a brief mention in Trillian's profile; it confirms Apple's skills as a reporter and his judgment on food. Trillan writes:
I was present when he found what he described in the Times as his "nominee for the single best crab dish in Baltimore, if not the Western Hemisphere"--the jumbo limp crab cake at Faidley Seafood, in the Lexington Market--and I can testify that his look of triumph did give him some resemblance to a very big four-year-old.
I can confirm that there is no better crab cake in Baltimore than Faidley's. I lived in Baltimore for seven years and never found a better one.
Tracing the Python Stack in GDB
When Python is hung or otherwise unresponsive, one option is to attach to the process with gdb and inspect if the C stack to find out what the interpreter is doing. It is straightforward if you are familiar with the Python internals; this note is intended to describe the basics for people who aren't.
info threads will show you each of the threads in the
(gdb) info threads 2 Thread 1147100464 (LWP 9881) 0xffffe002 in ?? () 1 Thread 1074906720 (LWP 8497) 0xffffe002 in ?? ()
where will display the C stack from the current frame
down let you up and down on the
stack. Look at the names of the C functions. Many of the lowest
functions will have obvious names, like time_sleep or select_select;
they'll tell you what system call the process is currently executing.
#0 0xffffe002 in ?? () #1 0x400236ca in time_sleep (self=0x0, args=0x41c7e734) at /home/jeremy/src/python/Modules/timemodule.c:208 #2 0x08108126 in PyCFunction_Call (func=0x403c237c, arg=0x41c7e734, kw=0x0) at ../Objects/methodobject.c:73 #3 0x080c07e4 in call_function (pp_stack=0xbfffc454, oparg=1) at ../Python/ceval.c:3439 #4 0x080bd0c9 in eval_frame (f=0x9397c5c) at ../Python/ceval.c:2116 #5 0x080bed95 in PyEval_EvalCodeEx (co=0x404f36b0, globals=0x403e7b74, locals=0x0, args=0x922fe54, argcount=2, kws=0x922fe5c, kwcount=0, defs=0x403f8208, defcount=1, closure=0x0) at ../Python/ceval.c:2663 #6 0x080c0bd9 in fast_function (func=0x403f9264, pp_stack=0xbfffc674, n=2, na=2, nk=0) at ../Python/ceval.c:3529 #7 0x080c0919 in call_function (pp_stack=0xbfffc674, oparg=1) at ../Python/ceval.c:3458 #8 0x080bd0c9 in eval_frame (f=0x922fcf4) at ../Python/ceval.c:2116
The eval_frame() function is important for inspecting the Python runtime. Each Python function call will produce a new eval_frame() call on the C stack. The argument f to eval_frame() is a PyFrameObject. The frame contains the current line number and a pointer to the PyCodeObject, which has the name of the function and the name of the file that contains it.
It is straightforward to print the line number, but extracting the file and filename strings is a little trickier. The PyCodeObject has PyObject * pointers, so you must cast them to PyStringObject * to extract the string contents.
(gdb) up 4 #4 0x080bd0c9 in eval_frame (f=0x9397c5c) at ../Python/ceval.c:2116 2116 x = call_function(&stack_pointer, oparg); (gdb) p f->f_lineno $1 = 195 (gdb) x/s ((PyStringObject*)f->f_code->co_filename)->ob_sval 0x404f3674: "/usr/local/lib/python2.4/threading.py"
The last line of Python code executed by this thread is line 195 of threading.py from the standard library. In Python 2.3 and later, the line number will point to the start of the function rather than the current line. This is a limitation of removal of SET_LINENO; it's not trivial to tell exactly what line the code is executing. This isn't usually a problem unless you've got very long functions. It would be possible to figure out what line number is executing by using f->f_lasti and examing the code's line number table, but I've never needed to do that.
Of course, it doesn't have to be that tedious. Define the following function in your .gdbinit, and you can call pyframe any time you are sitting in eval_frame().
define pyframe x/s ((PyStringObject*)f->f_code->co_filename)->ob_sval x/s ((PyStringObject*)f->f_code->co_name)->ob_sval p f->f_lineno end
#8 0x080bd0c9 in eval_frame (f=0x922fcf4) at ../Python/ceval.c:2116 2116 x = call_function(&stack_pointer, oparg); (gdb) pyframe 0x41a4474c: "/home/jeremy/src/branches/Zope-2_7-branch/lib/python/ZEO/zrpc/client.py" 0x404b7c14: "connect" $2 = 165
The New York Times ran an editorial about the California election today. It's a depressing result, and I think the editorial captured the real problem aptly:
The exit polls did not shed much light on the California voters' feelings, except their profound sense of irritation. A sizeable chunk of the Schwarzenegger voters said they had voted on the issues, but agreed that he had not really addressed them.
I'd like to write more shorter entries for the weblog. Two events conspired to get me on that track today. I fixed my weblog generator to deal with multiple entries on the same date, and I read Coding Smart: People vs. Tools by Donn M. Seeley.
Fred Drake pointed out that my RSS feed had an item about pychecker that pointed to the Johnny Apple item. I hadn't tested the generator when I wrote two entries for that day, and it sort of looked like it worked. It had generated a decent looking front page by accident, but that was it. So I fixed the software to use individual letters as tags in addition to the date.
Seeley's article, in ACM Queue, has a list of good work practices. First on the list is 1. Keep full development notes and debugging logs:
I use lab notes extensively. Now that this practice is ingrained, I can't imagine how I survived before. My notes are all online so I can search them using pattern-matching tools, and because I type faster than I can write. This practice gave me the single biggest boost to my productivity, reducing the amount of time spent trying to remember what I did a year ago/a month ago/a week ago/yesterday. Always record your breakthroughs and dead ends so you won't have to reinvent them. Lab notes are an amazing productivity tool that many good programmers never use.
Closely related to lab notes is the debugging log. You can keep track of debugging notes in a variety of ways: Annotate your debugging experience by hand; cut and paste debugging material into a lab-notes file; run a line-oriented debugger inside a scripting environment; or do all three! With logs, you should never need to re-create a debugging situation just to see the same data, and you should be able to search the logs for particular names or values later.
I've never used lab notebooks extensively. I use tools that produce some long-term records, like a version control systems, and I try to keep a lot of my email. Over the last few years working with Tim Peters, I've also started putting important hints in comments in the code. Comments and revision history are useful in a narrow way, because they are tightly integrated with my programming tools. Email is less useful, because it's hard to manage and search email.
A web log can serve as a useful notebook. It will get indexed by Google, so there's a decent search mechanism. I already use Google to search other people's "debugging logs." If you get an obscure error message you don't understand, just search for it with Google. If you're lucky, you'll find mailing lists where people asked and answered your question; it's so unlikely you're the first person to ask. If you're unlucky, you'll get a checkin message with the code containing the error text.
A tangent: I've enjoyed reading ACM Queue. It's too bad I can't get it instead of Communications of the ACM with my ACM membership. CACM seems to be overtaken by management professors; I seldom read anything other than Beyond Risks.
Demaine is MacArthur Fellow
The 2003 MacArthur Fellows were announced the other day. One of the fellows is Erik Demaine, who was just mentioned in the Origami talk I went to last week.
Python and C++ Style
The Python standard library steers an even course between heavy object-oriented style and a lower level style that eschews objects. Bjarne Stroustrup advocates a similar style for C++ programs.
artima.com is running the first part of an interview with Stroustrup. The low-level style in C++ is much different than the low-level style in Python (and the OO-style is more involved, too).
Stroustrup defends structs, which are a perfectly good way to represent simple data. He advocates using a class once there are internal invariants that need to be maintained to use an object. In Python, you can use tuples or classes as a structs. I tend to prefer classes for data structures, because you can refer to the parts by name.
Going to the big class hierarchy is again, you write more code than you need to, and you get too much connection between different parts. I particularly dislike classes with a lot of get and set functions. That is often an indication that it shouldn't have been a class in the first place. It's just a data structure. And if it really is a data structure, make it a data structure.
In addition to the connections between classes, inheritance forces you to think about the contract between a base class and its subclasses. Joshua Block writes about this problem in Effective Java -- in particular the rule "Design for inheritance or else prohibit it." (It's a good book even if you don't program in Java.)
The C++ standard library provides some simple data structures like strings and vectors that improve on C++. Python's built-in datatypes are in the same boat.
The Python standard library varies widely in style. There are very old modules like nntplib that were written when Guido was style learning Python style. There are newer frameworks like logging that tend towards OO style, or threading that is modeled explicitly on Java's threads.
Patrick Logan posted a link to Bill Pugh's Skip List Cookbook. I think they're great data structures, too.
A skip list is a probabilistic alternative to balanced tree. In its most basic form, it is easy to implement. The list contains nodes with a value and varying numbers of forward links; the number of links determines the size. The nodes are in sorted order and each node of size n has links to the next nodes of size 1 through n.
Practical implementations of skip lists can get more complicated.
Josh MacDonald and
Ben Zhao did a
nice comparison of skiplists and b+-trees for a class
project. They implemented a deterministic variant of the skip
list (see J. Ian Munro, Thomas Papadakis, and Robert Sedgewick.
skip lists. In Proceedings of the third annual ACM-SIAM Symposium
on Discrete Algorithms (SODA), 1992) and found the performance
competitive and the implementation simpler.
algorithm for doing lookups in a peer-to-peer distributed hash
table has some high-level similarities to a skip list. The problem is
to locate a node in the network given a key. The nodes are arranged
in a ring, and each node maintains a finger table that points
to other nodes at various distances around the ring (increasing powers
of two). The finger table is a bit like the table of forward pointers
in a skip list, although each node in a Chord rind has the same number
of forward pointers.
James Aspnes and Gauri Shah presented a paper at SODA '03 on
skip graphs. I haven't read it yet, but it sounds interesting. A
skip graph extend the distributed hash table algorithm with operations
based on the ordering of the keys. Apparently a similar data
structure is used in SkipNet.
The Chord algorithm for doing lookups in a peer-to-peer distributed hash table has some high-level similarities to a skip list. The problem is to locate a node in the network given a key. The nodes are arranged in a ring, and each node maintains a finger table that points to other nodes at various distances around the ring (increasing powers of two). The finger table is a bit like the table of forward pointers in a skip list, although each node in a Chord rind has the same number of forward pointers.
James Aspnes and Gauri Shah presented a paper at SODA '03 on skip graphs. I haven't read it yet, but it sounds interesting. A skip graph extend the distributed hash table algorithm with operations based on the ordering of the keys. Apparently a similar data structure is used in SkipNet.
Running Unit Tests
Python programmers need better tools for writing, running, and debugging unit tests. Ian Bicking's rant about unittest makes me think that Zope's test.py driver ought to find more widespread use.
Zope software tends to have a lot of unit tests. Zope3 has 5,000 tests and ZODB has about 1,500. Many of these test aren't, strictly speaking, unit tests. ZODB has tests that spawn a server process and spend a few minutes committing transactions against it to try to provoke race conditions. Nonetheless, they are all written using the unittest module and the test.py driver.
Bicking notes that the problem is really the default unittest driver:
Probably the issue here is that unittest.main is stupid. You can't add options (e.g., your own verbosity hooks), you can't easily indicate what tests you want to do... it's just dumb and not extensible.
Zope's test.py driver has a ton of features. We often had to struggle with unittest to implement them, but they are there. The short list is:
Jim Fulton has done a nice job of integrating doctest with unittest, so that you can write doctest tests and have them integrated into a unittest framework.
Any test driver has got to make some assumptions about how to find
the tests to run and the code being tested. test.py has some fairly
rigid rules that might not make sense for every project. All tests
are integrated with the rest of the code in sub-packages. The test
driver loads tests by importing them. If you've got a package
foo, then test.py looks for tests in
foo.tests. It searches for files in a tests package that
begin with "test" and calls the test_suite() function it finds in
It is integrated with distutils for testing C extensions. It knows how to find code in the distutils build directory so that you can use setup.py to build an extension and test.py to test it.
The driver doesn't say anything about how you write the individual test suites, and Bicking has some good and funny complaints here. For example, "I do need setup and teardown (but not setUp and tearDown -- I'm sorry, but this is English, "setup" and "teardown" are two words, not four)."
His specific example is this TestCase class:
class DecodeTest(unittest.TestCase): def __init__(self, input, output): self.input = input self.output = output def test(self): self.assertEqual(decode(input), output)
This doesn't seem like the right way to start. Maybe I just don't see what it's supposed to do. If you create a lot of TestCase instances, then you don't have names for them and you can't get a simple summary of which tests passed and which failed.
Printing Unicode from Python
I have been struggling with Unicode for my weblog aggregator. There are several feeds that include Unicode data in the title or description. I tried to generate HTML output with a UTF-8 encoding, but that didn't seem to work.
I thought part of the problem was that I didn't know anything about Unicode. For better or worse, I learned that I knew almost everything that a working programmer should know on the subject. The only thing I learned Joel Spolsky's nice essay on encoding issues is what a code point is. (I had a vague notion, but I didn't really know.)
The aggregator generates an HTML page including the description elements from all of the source feeds. All of the Unicode I've run into could be encoded as Latin-1, but UTF-8 seemed more general. I generate the HTML with print statements. (I should probably use a template system like ZPT, but I'm not familiar with any of them and the HTML is pretty simple.)
You can add a meta tag to the HTML page to specify a content-type and charset; iso-8859-1 is Latin-1.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
I had been using "charset=utf-8" but that was not getting interpreted as I expected. I tried Mozilla and IE. It ends up displaying funny characters like an a with a hat, something I don't recognize, and a superscript TM. (I see the same thing in the current display for Python Programmer Weblogs.) When I changed to Latin-1, the characters started displaying properly -- umlauts and all.
The big pain is that you can't just write a line of code like
print '<a href="%(guid)s">%(title)s</a>' % dbecause the title might contain unprintable Unicode data. You need to specify an encoding by calling the encode() method of the Unicode string.
print ("<a href=\"%(guid)s\">%(title)s</a>" % d).encode("iso-8859-1")It's just clumsy to have to write all that. Isn't there some way to specify that stdout should use iso-8859-1 to encode all Unicode strings?
Screensaver Interferes with Testing
Why do tests run from my nightly cron jobs run so much more slowly than when I run them manually? I'll be it's the screensaver, which is now just configured to blank the screen.
I run several batches of ZODB and ZRS tests every night. The tests spawn ZEO server processes, interact with them, then clean them up. There are a bunch of ad hoc timeouts intended to prevent the tests from hanging indefinitely if something goes wrong. If the test runs very slowly, they'll timeout even though they are making progress. There are also multi-threaded tests that try to run until each thread makes a certain amount of progress -- again with a timeout to guard against deadlock.
The nightly cron jobs were much more likely to fail because of timeouts. We've spent several weeks trying to figure out why the tests fail so often and have boosted several timeouts as a result.
Tim remembered that he had a problem with his screensaver at Dragon. His tests ran three times slower overnight and it took him a once to figure out that it was the 3D pipes screensaver in Windows.
Why Did He Leave Pedro In?
So close... It was an exciting postseason with the prospects of either the Red Sox and the Cubs reaching the World Series. The Cubs ending was tragic, but the Red Sox just maddening. The bull pen pitched so well throughout the playoffs. When Pedro Martinez was obviously struggling, why did Little leave him in?
The small consolation was to see some consistently good pitching from former Oriole and fellow Pennsylvanian Mike Mussina.
Linked: Evocative or Explanatory?
I just finished Linked: The New Science of Networks, Albert László Barabási's book about the study of networks -- specifically scale-free networks and their use in describing the Internet, biological systems, social networks, and the economy. It was an interesting but rather unsatisfying book.
The writing was at times forced. The opening chapters discuss mathematical models of random networks. Each section begins with a vignette about Paul Erdos or some other bit of history. They had some interest, but served primarily as an interruption to the main narrative. The writing is best when it sticks to the science. There's also some interesting personal background about the author's own exploration of the field, but it's not as good as Alan Guth's book The Inflationary Universe.
My main gripe, however, was never being sure whether the mathematical model of a scale-free network -- i.e. one in which power law distributions exist and there is no typical node, no characteristic scale -- was intended to be just a mathematical convenience or whether it was really a representative model of the networks in question.
Walter Willinger and others have a critique of the Barabási-Albert model that suggests my uneasiness was well-founded. The paper Scaling phenomena in the Internet: Critically examining criticality (Proceedings of the National Academy of Science, Feb. 19, 2002):
bring[s] to bear a simple validation framework that aims at testing whether a proposed model is merely evocative, in that it can reproduce the the phenomenon of interest but does not necessarily capture and incorporate the true underlying cause, or indeed explanatory, in that it also captures the causal mechanisms (why and how, in addition to what).
The Barabási-Albert model is widely known. In addition to Linked, it has been described by papers in Science and Nature. It explains the evolution of a graph relying on two simple phenomena -- incremental growth and preferential attachment. In the resulting model, the number of links per node follows a power law distribution; we expect to see many nodes with few links and a few nodes with very many links.
Willinger et al. conclude that this model is evocative:
In particular, we examine a number of recently proposed dynamical models of Internet traffic and Internet topology at the AS level that explain the entirely unexpected scaling behavior in terms of critical phenomena. In the process, we offer conclusive evidence that even though the models produce the self-similar scaling phenomena of interest, they do not explain why or how the observed phenomena arise in the Internet. Some of these criticality-based explanations can still be put to good use.
I'll have to read the article more carefully.
This morning I had a chance to measure the amount of spam I get -- about 40% of incoming mail. I was away for the weekend, so a lot of email collected on the POP servers; that gave me that chance to count spam from a large batch of messages. I fetched 1618 messages; Spambayes marked 628 as spam and 71 as unsure.
The percentage of spam may be lowering during the week. It's hard to measure since I collect a little email at a time, so I don't know how much ham comes in. Since the test period was over a weekend, I didn't get much work-related email. On the other hand, python-dev was busy -- more than 100 messages.
I get python-list delivered by email. That means I get a lot of ham traffic that I don't read closely. If half of my inbound email comes from python-list, then I'm getting more like 85% spam.
The number of unsure messages was unusually high. I received about 15 uncaught bounce notifications from Mailman; I've never trained on Mailman bounces as ham or spam, so they always come up unsure. I also received a lot of duplicates of new spams. New spams sometimes show up as unsure; usually, you get several spams at a time.
Neat! Slithy is a presentation tool that supports animation and some interactivity. The presentations are Python programs.
Integrating unittest and logging
There ought to be some way to integrating the logging and unittest module. That is, a test could say that it expects a certain message to get logged or that it does not expect a message to get logged. I don't know if there is a way to write tests so that this wouldn't be unbearably tedious, but I suspect there is.
Jim and I have been talking about this feature for years, but have never had the time to pursue it. I was reminded of it recently because I was testing code that was designed to catch and log errors and continue. There was a real bug in the code that I spent hours trying to find; it was getting logged but its only effect on the test was to cause it to hang.
Large frameworks tend to have layers that catch unexpected exceptions and try to continue. It could be argued that this is a misfeature, but let's assume that it isn't. How do you test these systems? A legitimate bug in the code won't cause the framework to fall over, it will just keeping going. One possibility is to write a test that asserts that no errors are logged while the test is running.
We've made several abortive attempts to cause tests to fail when errors are logged. The problem is that many tests cause errors to be logged as part of normal operation. A ZEO server logs exceptions raised when executing remote method invocations. It is convenient to review these logs on a running server to look for signs of trouble. Many of the tests deliberately provoke exceptions, which are logged by the server. So those tests would need to indicate that they expect some error to be logged.
The question, then, is how to write a declaration of expected logging behavior. Should you write declarations for tests that do generate logging messages? Or can you should write declarations that say when you don't expect logging? The former seems more promising, because it would be hard to say what wasn't expected. In either case, however, you'd probably need to limit it to certain subsystems (assuming that the test involved several packages). Otherwise, a change to some other subsystem to add a log message could cause your tests to fail.
Chandler and ZODB
Early versions of Chandler used ZODB, but it was removed in favor of a repository system they developed themselves. Andi Vajda, the repository developer, posted a short explanation for that decision today:
Today, the Chandler repository is not really so much an object database as an item XML database combined with large collections of references directly stored in Berkeley DB....
The trade-off for these decisions and design choices is a somewhat steeper learning curve for programmers expecting a real object database like ZODB. My hope is that this trade-off is well worth the gains.
John Anderson, Chandler's architect, offered the following comment about ZODB:
In my discussions with various programmers concerning ZODB, it seems like they are reluctant to spend much time learning it and would rather go invent something new.
A key mismatch between ZODB's goal and the Chandler developer's goal was language independence. One of their developers explained to me that they might want to run client code on the repository and Python was too slow to run on a server. It's an odd comment given that ZODB's primary purpose is to provide persistence for an application server.
I haven't seen a document that describes the current Chandler repository or its programming model. It would be interesting to see how Chandler-specific requirements affected the protocol. I recall that an early design for the repository used an XML-based protocol that was like IMAP but with many more bells and whistles. It wasn't completely obvious how to integrate this protocol with ZODB. At the same time, it wasn't clear how many of these requirements were just YAGNI.
The key integration problem was granularity. ZODB works at the level of individual objects. When you request an object, you get the persistent object and any non-persistent objects it contains. (They might be better called second-class persistent objects; they do persist, but they can't be shared or referred to be name.) If you want to be able to load part of something, you need to organize that something as a collection of individual persistent objects. In Zope, large binaries are stored as a collection of smaller persistent objects, so that you can load part of the data without loading all of it. In that particular case, Zope is more complicated in order to get more fine-grained control over resource usage. In general, it seems good to have a single naming and sharing mechanism.
Another interesting requirement for Chandler was to be able to load objects in bulk from the server. It would be nice to have some kind of pre-fetching that would expose this to ZODB. I wrote about prefetching in April.
Lightweight Languages Program
The program is set for the Lightweight Languages Workshop. There are many programming languages represented, including Haskell, Lua, ML, Perl, Python, Ruby, and Scheme. The topics include hardware design, testing distributed systems, virtual machines, Lego MindStorms, and web programming.
One of the best things about this workshop series is that the video of the talks is archived on the web. You can still see last year's excellent keynote talks: Joe Amstrong on concurrency-oriented programming and Todd Proebsting on disruptive programming language technologies.
Learning About Weblog Services
A bunch of notes about services for reading weblogs. I want to learn more about these things.
Roland Tanglao's weblog has notes about the aggregators discussion at BloggerCon.
Technorati and Feedster were mentioned as having "contextualization and global view." Sounds like they are some kind of search engine. I need to find out what these services are. I visited the Technorati site once, but left without figuring out what it was for. The design was far from obvious.
It should be straightforward to hook an aggregator up to a search engine like ZCTextIndex. Bummer. It looks like no one ever packaged it for standalone use.
It should also be possible to have more structured queries. I took a quick look at XQuery and XPath, but the syntax turned me off. Erik Meijer has a better idea: provide XML support integrated directly with your favorite programming language.
If I wanted to read a lot of blogs, it seems obvious to hook the feed up to Spambayes so that I can focus more on the things I am interested in. I like reading Fredrik Lundh, but I don't expect I'll ever learn Swedish.
Another interesting idea is to use feeds to drive Plucker. If you've only got whatever will fit on a Palm or PDA, you need to be a little selective. (The little aggregator I'm running currently produces 476KB of HTML, and that's only a dozen or so feeds.)
It might be nice to have a second-order aggregator. The first-order aggregate is just the feeds that you are intentionally reading. The second-order aggregate is everything that they are reading. Either scrape their reading lists from the weblogs or get a narrower view based on what they actually link to in their weblogs.
Nearly forgot this one. Need to find out about Atom.
Helium Compiler for Haskell
Helium is a compiler for Haskell that is designed for students. Quality of the error messages has been the main concern...
A serious drawback of type inference is figuring out what went wrong when an error occurs. It looks like they've done a lot of work to provide better error messages.
ZODB 3.3 and Persistent Code
We had a meeting today to review the state of persistent code (modules, functions, and classes) in ZODB. Fred Drake is planning to port that code to ZODB3 for the 3.3 release.
(Yay! Fred has a weblog.)
I don't know when the release will be ready, but I'm optimistic that we'll at least get to beta this year.
The implementation for persistent code is fairly delicate. The basic idea is to load a module's source code and convert all the classes and functions into persistent objects. When you edit the code, the persistent module is modified in place to reflect the changes. The great thing about this feature is that changes to code can use the same transaction framework as changes to data. We treat code like data.
The 3.3 release sounds like it will be a big step forward. It will support new-style classes, like ZODB4, but will be backwards compatible with existing Zope and ZODB applications. It will also support multi-version concurrency control.
Fun Python Projects
Brett Cannon asked for ideas for a master's thesis related to Python. He got some interesting responses. Neil Schemenauer came up with most of the topics listed here, but his topics are dear to my heart. (Neil's suggestions are in italics.)
The key problem is that there's no good way to extend the syntax of the base language and still get it parsed. The compiler package doesn't have a parser. It looks like Polyglot realizes for Java what I had hoped to do with the compiler package.
Question: Is there a boundary between what can be accomplished by macros and what requires other mechanisms for extending the language? Could the Polyglot extensions be implemented in Maya?
I don't get it. I don't know the Zope community very well; mostly, I keep my head down and work on core ZODB issues. This whole thread about a 'hidden hand' seems hard to believe. Hadar's a straightforward guy; I don't see him or ZC having a hidden agenda.
Anyway, Dean Goodmanson says the Zope credits remain to be updated. It could be a good project for getting people in a better mood. Zope3's credits are extensive, but I know it took Jim days to write them. Python's acknowledgments are less detailed, but easier to maintain. It would be great to get at least an ACKS list for Zope.
Avoiding Contention in an Optimistic (Zope) Object Database
How do you implement high-performance shared objects using optimistic concurrency control? This note sketches some ideas about scalable data structures for ZODB.
The catalog is a major hotspot in a lot of Zope applications. Any time content is modified, the catalog re-indexes the content. The catalog's indexes cause several problems. They are a hotspot, because they are frequently updated, and they interfere with undo, because a single transaction can get content modification mixed up with index changes.
We can relax application-level consistency by batching catalog operations. The catalog will be a little out of date, but we can improve performance a lot in return. If we can collect batched operations into per-client buckets, we can avoid conflicts and scale to many clients.
ZODB supports conflict resolution, a scheme that works a bit like field calls. If a transaction cause a write conflict, it's resolver is executed at commit time. The resolver can try to compute a new state for the object that takes both changes into account. It can allow a transaction to succeed, instead of aborting and causing the client to retry.
A conflict resolution method is invoked with three copies of the object state: the state written by the current transaction, the state of the object when the transaction started, and the current state in the database. The first two states can be used to compute a logical diff; what change was the current transaction trying to make? Given the current state in the database, can the same change be made to it? If so, then conflict resolution can succeed. (Jim Fulton reminded me of how helpful it is to think about conflict resolution this way.)
A field call, described in Gray and Reuter's Transaction Processing book, is an old technique used in databases with locking. An update consists of a predicate and a transform. If the predicate is true when the transaction runs, then the transform is applied to the current state at commit time. It allows you to minimize the time you hold locks; you just need a read lock for the predicate, and the transform executes quickly. Conflict resolution is essentially the transform, but the predicate is the entire transaction up to that time.
In client-server ZODB, we chose to implement conflict resolution on the server. Normally, the server just writes database records to disk. Occasionally, it does garbage collection (pack), and needs to follow object references in that case. For conflict resolution, it's actually loading object state and running arbitrary application code. That's a scary prospect, but was justified on performance grounds. No measurements taken so far as I know, but it would take a long time if the server sent that data to the client and waited for it to return a new value. During that time, the database would be locked, preventing other transactions from committing.
There are three primary downsides to running conflict resolution on the server. The server is a scare resource, so it needs to be as efficient as possible. It can be hard to guarantee that client and server are running the same version of the software, which is usually necessary for conflict resolution to work. And, worst of all, bugs in the application's conflict resolution code can crash or hang the server.
One alternative, which I don't recall from the original design work, is to raise a conflict error that includes the three object states, abort the transaction, and have the client retry. The client would just run conflict resolution and retry the transaction with all the other state; it would not re-run the entire transaction. That would push the extra computation back to the client without the full cost of a retry. Since the server is often a bottleneck, it might be worth measuring.
Conflict resolution has been useful for specialized data structures like BTrees. In the case of the catalog, it may be better to design a different data structure that avoids conflicts all together or makes them unlikely during normal operation, including under heavy load.
The catalog queue allows catalog updates to be delayed. The idea is to put update events into a queue in the committing transaction. A different transaction periodically removes events from the queue and indexes the documents. The index process gets to amortize the cost of updating the index across multiple updates. If a document is updated several times in short time period, only one index operation is needed. It trades relaxed consistency between data and catalog for increased performance. The catalog queue objects use a complex conflict resolution scheme to allow multiple readers and writers.
The new idea is to implement the queue using a bunch of individual buckets and have each client write to a different bucket. The bucket would also have a flag to say whether it is being filled or drained. It starts out in the fill state. At some point, either because the bucket is too big or because too much time has elapsed, the bucket gets set to the drain state. Once in the drain state, the indexer starts processing the events in it.
This data structure would be implemented with no conflict resolution support. All the conflicts would cause the client to retry. One challenge is to prevent two different clients from filling the same bucket. You would still expect a conflict per client when a bucket changed from the fill to the drain state.
To avoid conflicts on buckets, you need to have more buckets than clients and a decent hash scheme for assigning clients to an arbitrary bucket -- something like Akamai's consistent hashing scheme. If a conflict does occur, the client getting the conflict needs to pick a different bucket. You could detect either read or write conflicts to make the decision.
If the number of clients is variable, the number of buckets might need to grow over time. I'm not sure how that would be accomplished. Perhaps a client writes its unique id in the bucket. When a new client comes along, it tries some number of times to find an un-owned bucket. If it can't find one, it creates a new bucket and adds it to the queue.
On goal of queued catalog is to collapse multiple operations on a single piece of content, so that you only re-index it once. It may be harder to achieve collapsing with this scheme, because different users will use different ZODB clients for each request. One client's operations could be spread across multiple buckets. Perhaps the indexer drains multiple buckets at the same time to address this. Or maybe there should be some kind of app server affinity in the cluster configuration.
One reason this scheme could work is that the catalog has fairly weak semantics. If you update content, it will be re-cataloged some time after the update, where the application and operators can tune the time versus desired performance.
There must be some literature on these sorts of problems, but I haven't found it yet. Here are two papers that seem close but not quite right:
Maurice Herlihy, 1990. Apologizing versus asking permission: optimistic concurrency control for abstract data types. ACM Transactions on Database Systems, 15, 1 (March), 96-124.
Steven D. Gribble, Eric A. Brewer, Joseph M. Hellerstein, and David Culler. Scalable, Distributed Data Structures for Internet Service Construction. Proceedings of the Fourth Symposium on Operating Systems Design and Implementation (OSDI 2000).