|Jeremy Hylton : weblog : 2004-01-16|
Friday, January 16, 2004
We fixed a serious bug in the FileStorage pack implementation this week. If you packed twice, first to time T1 and then to time T2 and T2 < T1, then you could lose data. The bug was introduced by my rewrite of pack for Zope 2.6.2. What went wrong?
The are a variety of answers. The bug itself is caused by the way FileStorage pack is implemented. It traverses the database to find all object revisions reachable as of the pack time. This traversal ignores any revision written after the pack time; they will be copied in a later phase. The unreachable revisions are then removed. A subsequent traversal using an earlier pack time is impossible. Older object revisions that provided connectivity among objects were removed by the first pack. This is basically the same reason you can't undo to a packed record. It isn't possible to provide a consistent snapshot of database state before the pack time.
Before I fixed the problem, I wrote a simple test case, involving three objects, that reproduced the data loss that was originally reported. It's important to understand the source of the bug completely before you try to fix it. If you don't understand the cause, the fix probably won't work either. And sometimes you'll find a couple more bugs while trying to work out what went wrong.
The previous implementation of pack had an explicit test for this case. Jim Fulton had discovered and addressed the problem in the early days of FileStorage. When I talked to him he said, "I fixed that bug." So it's a real subtlety of pack that had eluded him the first time around, too.
Tom Van Vleck wrote a nice article about fixing bugs, Three Questions about Each Bug You Find. The three questions are about addressing the failure -- both in the software and in the process. For this particular case, the third question is the most interesting.
1. Is this mistake somewhere else also?
With the test case in hand, I confirmed that the BerkeleyDB-based storages don't have this problem. They use a very different garbage collection scheme, so you can pack to an earlier time if you want; it won't have any effect, good or bad.
The GC traversal isn't performed in any other part of the storage. The undo implementation already handles previous packs. It checks for a transaction record with status 'p' (for pack) and stops if it finds one.
I don't think this issue comes up anywhere else.
2. What next bug is hidden behind this one?
There shouldn't be other bugs hiding behind this one. We were allowing insane packs to run. The fix is to raise an exception if you attempt to pack to a time earlier than a previous pack time. We're strictly reducing the number of possible program executions.
3. What should I do to prevent bugs like this?
The unit tests for pack probably don't cover a lot of edge cases. It's difficult to write a pack test, because you need to set up a network of objects with just the right set of references and revisions. I wrote several new tests before I started on the new pack implementation; they uncovered a few problems in the Berkeley storage pack.
We need to work harder on black-box testing. From the API alone, we could have asked: "What happens when you pack a second time with an earlier pack time?" On the other hand, it wouldn't be a bug even for FileStorage, unless you have the right history of object modifications and previous packs.
There aren't any good design documents about FileStorage, not even notes on the implementation. I think this is a design bug. If pack revolves around traversal of historical revisions, we should have worked that issue out. The new pack implementation is easier to read than the last one, but we didn't think to write down notes about the design issues. We should.
We need to manage bug reports better. I didn't find out about the problem until Toby Dickenson reported data loss on the zodb-dev list. In later discussion, Dieter Maurer mentioned that several problems had been reported on a Plone list. Eventually, we found a bug report and diagnosis sitting in the Zope Collector, from way back in November. We aren't doing a good enough job of connecting bug reports and developers. There may be many causes; e.g. users don't know where to send bug reports.
There was an article just the other day on Slashdot titled Rewrites considered harmful?. I didn't read the whole article, but it looked to be contemporary examples of the second system effect. It also had a link to an interesting article by Joel Splosky on the worst strategic mistake a software company can make rewrite the code from scratch.
Perhaps the rewrite of FileStorage was a mistake? Those two articles are really talking about software strategy and I'm talking about tactics. FileStorage is a small component of Zope and pack is just one operation, albeit the largest and most complex in FileStorage. The rewrite was a big improvement in a lot of ways, but I hadn't tracked down all the edge cases in the design.
One part of Splosky's article bugged me, though, and it's related to the redundant pack bug case. We had fixed this bug before, years earlier. Arguing against rewrites, Splosky says:
Each of these bugs took weeks of real-world usage before they were found. The programmer might have spent a couple of days reproducing the bug in the lab and fixing it. If it's like a lot of bugs, the fix might be one line of code, or it might even be a couple of characters, but a lot of work and time went into those two characters.
When you throw away code and start from scratch, you are throwing away all that knowledge. All those collected bug fixes. Years of programming work.
The problem is that all that knowledge and the hours or days spent tracking down a bug shouldn't be captured in just two characters of source code. If the only artifact of software engineering that you have is source code, you are in trouble. There ought to be a test case, comments, design and implementation notes, revision history. If someone learned a lot when they fixed the bug, they ought to make sure that knowledge gets capture effectively.