Jeremy Hylton : weblog : January 2004 last modified Thu Mar 17 01:11:16 2005

Jeremy Hylton's Web Log, January 2004

compiler module needs improvement

permanent link
Monday, January 05, 2004

A question about the compiler package on comp.lang.python over the weekend and some work I did last month remind me that there are a lot of loose ends. The documentation should be better. It should be easier to compile an AST.

I should write down a long list of other things to do: Incorporate one of the pure-Python parsers so that there's no dependence on the C-Python core. Remove all the code design to support optional nested scopes in Python 2.1. Make it easier to extend various components by subclassing, which requires documenting those pieces that are intended to be used this way.

ZODB 3.3a2 released

permanent link
Tuesday, January 06, 2004

We finally got the ZODB 3.3a2 release finished. The new features should make a big difference for users, and I'm generally happy with the state of the codebase. I had originally hoped to do it on Christmas Eve, but many small-but-hard-to-diagnose bugs delayed it.

I said more the in release announcement.

The new features represent major changes to ZODB. Support for new-style classes means that ZODB programmers can use the new class features introduced in Python 2.2, like properties and descriptors. It also means that ZODB no longer uses ExtensionClass, which caused many subtle incompatibilities with other classes.

The multi-version concurrency control feature represents the end of read conflicts. If a transaction encounters a conflict, it reads older revisions of the object rather than raising a ReadConflictError. This feature allows read-only transactions to see a consistent view of the database without need to handle read conflicts and restart the transaction.

The code is fairly stable, but we have noted a few hard-to-provoke bugs during development. One problem is a refcount problem in the persistence core that causes rare core dumps in Python's garbage collector. (Rare means once a week across multiple developers and testers.)

We hope to move to a beta release expeditiously, perhaps as early as next month. The final release schedule depends on the Zope 2.8 release schedule. More on that later.

Jim Fulton and Fred Drake are planning to work on persistent modules this month, first in Zope 3 / ZODB 4. I expect that we will get that code into the next ZODB 3.3 release, too. They have some notes in the Zope3 "Modules are Global" proposal.

I also noticed that Jim checked in some code for persistent weak references. A persistent weak reference is a reference to a persistent object that does not prevent the object from being packed. We had talked about this a few weeks ago, but I didn't realize it was actually going to be implemented. The chief difference with Python's weak references is that there is no callback function called when the object is collected. It doesn't make a lot of sense for ZODB, because it would have to be called during the pack operation; pack is a low-level mark-and-sweep operation that runs without any of the runtime necessary to execute user code.

Zope Security Issues

permanent link
Thursday, January 08, 2004, 9:35 p.m.

The Zope security work done last month was merged into the public CVS today. Lots of fixes and changes.

Tres Seaver did a nice job of extracting the various fixes and merging them individually. The CVS history in the public repository should clearly identify which changes were made to address which bug. I'm sure it took a lot longer to do this way, but it was the right thing.

I suppose one nice thing is the conclusion: In the course of the evaluation, very few of the Python changes in 2.3.3 directly affected the Zope security architecture or had impacts on the restricted execution model. It underscores that Python 2.3 is a fairly conservative release with few compatibility issues.

I wrote a couple of web log entries while I was working on the project, but I didn't post them until the fixes were made public. I discussed how we generation custom bytecode for unpacking and how we might write a simple bytecode verifier to check that the transformation are correct.

Python versus Parrot challenge

permanent link
Thursday, January 08, 2004, 10:02 p.m.

Jarno Virtanen posted a nice report about the Pie-thon, a performance contest pitting CPython against the Parrot virtual machine.

He observes that Dan Sugalski was bugged by the easy dismissal of the challenge. The Parrot-doesn't-stand-a-chance attitude. Perhaps the problem is that Dan makes everything sound easy; it's hard to get a read on what he thinks is hard and what he thinks is easy. He writes that it should be pretty trivial to get parrot to run bython bytecode, but I don't think that's the hard part. I expect it would be hard to get an implementation of the Python runtime, new-style classes in particular, that is compatible with the Parrot internals.

Another reason for the lack of interest among Python developers is that it's hard to figure out what the state of the Parrot project is. The Parrot web site makes it look like the project is stalled. Most of the documents are old and I don't see any clear overview of the system.

Vector Resizing Algorithm

permanent link
Thursday, January 15, 2004

Andrew Koenig explains The vector class in C++ STL grows by a factor of 1.5 when it is resized. (Link noted on Amit Patel's blog.) It took my a while to unpack the argument.

The goal is to make the resizing factor be less than (1 + sqrt(5)) / 2 -- the golden ratio. Koenig writes:

Suppose you are using a first-fit memory allocator, and you're progressively appending to a vector. Then each time you reallocate, you allocate new memory, copy the elements, then free the old memory. That leaves a gap, and it would be nice to be able to use that memory eventually. If the vector grows too rapidly, it will always be too big for the available memory.

The last sentence didn't make sense: How would the rate of change of the vector's size affect available memory? After reading the rest of the thread on comp.lang.c++.moderated, I understood that available memory meant unused memory managed by the allocator, rather than system memory that might be allocated by sbrk().

The logic seems to be that if you resize three times, on the third time the free space left by the last two copies of the vector is large enough to hold the newly resized vector. This argument assumes that a realloc() of the vector actually uses all new memory, that the memory would be allocated contiguously, and that other allocations didn't consume too much memory.

I wonder if there are any first-fit allocators in common use. Doug Lea's malloc is a best-fit allocator. He notes: As shown by Wilson et al, best-fit schemes (of various kinds and approximations) tend to produce the least fragmentation on real loads compared to other general approaches such as first-fit. I suppose it doesn't hurt to choose a constant that would work well with this hypothetical first-fit allocator, given that you have to pick some constant.

P.J. Plauger offered a slightly different reason: VC++ V7.0 also uses 1.5. We settled on that factor as a better balance between memory utilization and expansion efficiency. It's harder to argue with that reason.

How about resizing algorithms in Python?

The amount a list object is resized depends on its size. For a list with n elements, when n < 2 ** (5 + 3 * i), the list size is rounded up to the nearest multiple of 2 ** (3 * i). A detailed comment in listobject.c explains:

This over-allocates proportional to the list size, making room for additional growth. The over-allocation is mild, but is enough to give linear-time amortized behavior over a long sequence of appends() in the presence of a poorly-performing system realloc() (which is a reality, e.g., across all flavors of Windows, with Win9x behavior being particularly bad -- and we've still got address space fragmentation problems on Win9x even with this scheme, although it requires much longer lists to provoke them than it used to).

A dictionary is resized whenever its fill count exceeds two-thirds of its size. (Note that the fill count is not the same as the number of keys in the dictionary, because a dictionary may contain internal dummy entries, generated when a key is deleted.) A dictionary quadruples in size when it is resized, unless the dictionary has more than 50000 items, then it doubles in size. The justification is that a sparse dictionary reduces the number of collisions and that quadrupling limits the number of resizes that occur as a dictionary grows. A dictionary resize is expensive -- much more expensive than a list resize -- because it creates a new hash table and inserts each non-dummy item in the old table into the new table.

A New Bug in FileStorage

permanent link
Friday, January 16, 2004

We fixed a serious bug in the FileStorage pack implementation this week. If you packed twice, first to time T1 and then to time T2 and T2 < T1, then you could lose data. The bug was introduced by my rewrite of pack for Zope 2.6.2. What went wrong?

The are a variety of answers. The bug itself is caused by the way FileStorage pack is implemented. It traverses the database to find all object revisions reachable as of the pack time. This traversal ignores any revision written after the pack time; they will be copied in a later phase. The unreachable revisions are then removed. A subsequent traversal using an earlier pack time is impossible. Older object revisions that provided connectivity among objects were removed by the first pack. This is basically the same reason you can't undo to a packed record. It isn't possible to provide a consistent snapshot of database state before the pack time.

Before I fixed the problem, I wrote a simple test case, involving three objects, that reproduced the data loss that was originally reported. It's important to understand the source of the bug completely before you try to fix it. If you don't understand the cause, the fix probably won't work either. And sometimes you'll find a couple more bugs while trying to work out what went wrong.

The previous implementation of pack had an explicit test for this case. Jim Fulton had discovered and addressed the problem in the early days of FileStorage. When I talked to him he said, "I fixed that bug." So it's a real subtlety of pack that had eluded him the first time around, too.

Tom Van Vleck wrote a nice article about fixing bugs, Three Questions about Each Bug You Find. The three questions are about addressing the failure -- both in the software and in the process. For this particular case, the third question is the most interesting.

1. Is this mistake somewhere else also?

With the test case in hand, I confirmed that the BerkeleyDB-based storages don't have this problem. They use a very different garbage collection scheme, so you can pack to an earlier time if you want; it won't have any effect, good or bad.

The GC traversal isn't performed in any other part of the storage. The undo implementation already handles previous packs. It checks for a transaction record with status 'p' (for pack) and stops if it finds one.

I don't think this issue comes up anywhere else.

2. What next bug is hidden behind this one?

There shouldn't be other bugs hiding behind this one. We were allowing insane packs to run. The fix is to raise an exception if you attempt to pack to a time earlier than a previous pack time. We're strictly reducing the number of possible program executions.

3. What should I do to prevent bugs like this?

The unit tests for pack probably don't cover a lot of edge cases. It's difficult to write a pack test, because you need to set up a network of objects with just the right set of references and revisions. I wrote several new tests before I started on the new pack implementation; they uncovered a few problems in the Berkeley storage pack.

We need to work harder on black-box testing. From the API alone, we could have asked: "What happens when you pack a second time with an earlier pack time?" On the other hand, it wouldn't be a bug even for FileStorage, unless you have the right history of object modifications and previous packs.

There aren't any good design documents about FileStorage, not even notes on the implementation. I think this is a design bug. If pack revolves around traversal of historical revisions, we should have worked that issue out. The new pack implementation is easier to read than the last one, but we didn't think to write down notes about the design issues. We should.

We need to manage bug reports better. I didn't find out about the problem until Toby Dickenson reported data loss on the zodb-dev list. In later discussion, Dieter Maurer mentioned that several problems had been reported on a Plone list. Eventually, we found a bug report and diagnosis sitting in the Zope Collector, from way back in November. We aren't doing a good enough job of connecting bug reports and developers. There may be many causes; e.g. users don't know where to send bug reports.

Rewrite from scratch

There was an article just the other day on Slashdot titled Rewrites considered harmful?. I didn't read the whole article, but it looked to be contemporary examples of the second system effect. It also had a link to an interesting article by Joel Splosky on the worst strategic mistake a software company can make rewrite the code from scratch.

Perhaps the rewrite of FileStorage was a mistake? Those two articles are really talking about software strategy and I'm talking about tactics. FileStorage is a small component of Zope and pack is just one operation, albeit the largest and most complex in FileStorage. The rewrite was a big improvement in a lot of ways, but I hadn't tracked down all the edge cases in the design.

One part of Splosky's article bugged me, though, and it's related to the redundant pack bug case. We had fixed this bug before, years earlier. Arguing against rewrites, Splosky says:

Each of these bugs took weeks of real-world usage before they were found. The programmer might have spent a couple of days reproducing the bug in the lab and fixing it. If it's like a lot of bugs, the fix might be one line of code, or it might even be a couple of characters, but a lot of work and time went into those two characters.

When you throw away code and start from scratch, you are throwing away all that knowledge. All those collected bug fixes. Years of programming work.

The problem is that all that knowledge and the hours or days spent tracking down a bug shouldn't be captured in just two characters of source code. If the only artifact of software engineering that you have is source code, you are in trouble. There ought to be a test case, comments, design and implementation notes, revision history. If someone learned a lot when they fixed the bug, they ought to make sure that knowledge gets capture effectively.

Zope 2.8 Plans

permanent link
Saturday, January 17, 2004

Jim Fulton posted an updated Zope 2.8 release schedule. He hopes to have a release in March 2004. That would be a record-setting schedule for Zope, since we haven't released 2.7 final yet.

Plone Interview

permanent link
Sunday, January 18, 2004

O'Reilly has an interview with Alexander Limi and Alan Runyan, the Plone guys. Plone came out on top of O'Reilly's open-source-at-Comdex popularity contest

Simple Python Aggregator

permanent link
Thursday, January 22, 2004

I've been working on a very simple Python RSS aggregator as a way to learn more about RSS and to see what kinds of data you find in real life. I decided to package up the code and release it -- all 500 lines of it. It's sagg 0.1, a simple aggregator.

The simple aggregator (sagg) fetches a collection of RSS feeds and produces a simple HTML page containing all the entries sorted by time. It's a small package -- less than 500 lines of code -- intended to demonstrate some basic techniques.

I use Fredrik Lundh's ElementTree package so that I didn't have to deal with any low-level XML issues. ElementTree, in turn, uses Python's builtin XML parser. ElementTree has the cleanest API of any of the Python-XML bindings I've seen.

Sagg is very similar to Spycyroll, the former PyBlagg, started by Vattekkat Satheesh Babu. The chief difference is that Spycyroll uses Mark Pilgrim's Ultra-liberal feed parser and Sagg uses its own ElementTree-based parser. It parses fewer feeds, but it's much less code. Pilgrim's parser is about three times bigger than Sagg itself.

I tested the ultra-liberal parser against my ET parser. The results aren't as bad as I feared. I collected 2782 feeds from syndic8. The Expat-based parser choked on 276 of them, about 10 percent.

Mailman bounce action

permanent link
Saturday, January 24, 2004

Mailman has been disabling me from its list because it is sending me spam. The mail servers at alum.mit.edu recently install a spam filter, so Mailman receives SMTP errors when I tries to deliver spam that comes through a list. After a certain number of errors, Mailman decides my address is undeliverable and disables delivery on that address.

I spoke briefly with Barry, and he agreed that Mailman ought to have better heuristics for bounce handling. He mentioned that one project he works on sends a probe message after a certain number of bounces. If the probe fails, then it disables delivery. It seems like another heuristic might have to do with percentage of messages reject. For most of the lists, there is more ham than spam and the ham all gets through. For low-traffic lists like meta-sig, it may not work; there could be more spam than ham on the list. (I've got my own spam filter, so I wouldn't know.)

Why does Mailman bother with disabling accounts for bounces? As long as Mailman prevents the bounces from getting delivered to the list administrator, who cares? I don't actually think Mailman should deliver to all address regardless of whether they bounce, but it's an interesting question. In the extreme, Mailman could produce a large volume of undeliverable email that is wasteful of resources on both servers. But what if it had a very conservative poliy, say, disable an account if it bounced every message in the last month or last 100 messages (whichever is longer). How wasteful would that be? Considering the great volume of mail through python.org on a daily basis -- about 900MB of email delivered on a typical weekday -- the bounces are probably a small fraction of the problem.

The Mailman vs. spam filters dilemma is a real one, though. If an ISP provides spam filtering, what can the typical user do? I'm surprised that I haven't heard about the problem before. Perhaps most ISPs mark messages as spam but don't actually reject them.

Lemelson Patents Are Invalid

permanent link
Monday, January 26, 2004

Several of Jerome Lemelson's patents on machine vision were invalidated by a federal judge in Las Vegas today. Lemelson described himself as a prolific inventor, but the recent ruling is further evidence that he was just a prolific writer of generic patent applications. He did earn more than a billion dollars in royalties for his submarine patents and even got MIT's stamp-of-approval by donating $6.5 million for an annual inventor

Lemelson's bio on the MIT site says:

Lemelson's breakthrough invention, and the one for which he was most proud, was a universal robot that could measure, weld, rivet, transport and inspect for quality control -- capabilities made possible by machine vision, a new technology that used computers to analyze digitized images captured from a video camera.... Today this concept, machine vision, is used by automotive and electronic companies the world over for automated precision manufacturing.

It is exactly these patents that the judge declared invalid and unenforceable. Lemelson has collected more than $1.5 billion in license fees for these and other patents. With that kind of money at stake, I can only assume that decision will be appealed.

A recent article in IEEE Spectrum offers a different and more compelling view:

His detractors attribute much of his success to the use of Byzantine tactics for exploiting loopholes in the patent system. Even Arthur Lieberman, his former attorney, believes he simply had a knack for figuring out where an industry was headed, and then claiming that he had already been there.
Later it elaborates:
[Pre-1995 patent rules] allowed clever applicants like Lemelson (with the aid of their patent attorneys) to secretly postpone, amend, and supplement their patent applications with generic terminology until they covered commercially successful products that came into existence many years after the patent application process was initiated.

When we covered the MIT-Lemelson prize at The Tech, we didn't know much about Lemelson beyond what the News Office was telling us. We should have asked why we had never heard of someone who claimed to have invented key components of camcorders, compact disk players, and fax machines.

Looking back on the original article, it's interesting to see that none of the MIT people quoted saying anything about Lemelson's accomplishments. Glen L. Urban, then the Dean of the Sloan School of Management, said the money would be very good for Sloan and offered that Lemelson "has a grand vision here of trying to make invention as salient as being an athletic hero to young people."