Jeremy Hylton : weblog : May 2003 last modified Thu Mar 17 01:11:16 2005

Jeremy Hylton's Web Log, May 2003

Attribute Access Microbenchmark

permanent link
Tuesday, May 06, 2003

I tried out the new timeit module today. How long does it take to access an attribute on a class instance? I tried three variations: a classic class, a new-style class, and a new-style class with slots.

The class had 16 attributes named "a1" .. "a8" and "b1" .. "b8". The timeit test statement was just "obj.b1". The three results are:

Regular new-style class 0.24
New-style class w/slots 0.18
Classic class 0.16

Measurement: 200,000 iterations on my laptop.

Conclusion: Attribute lookup on a new-style class is 50% slower than on a classic class. The use of slots speeds up lookup and gets closer to the classic class speed.

What if we add inheritance to the mix? I created subclasses of the original classes. The classic and new-style class had no body (just a pass). The slot-style class said __slots__ = "a1", just because it needed to declare that it used slots. Then I repeated the same steps, so I could test with one base class and two.

1BC 2BC
Regular new-style class 0.26 0.29
New-style class w/slots 0.25 0.31
Classic class 0.16 0.16

The new-style class is a little slower, but the slot-style class is a lot slower. The slot-style class is now slower than the new-style class without slots. The classic class stayed the same. It looks like slot-style classes continue to get worse as the inheritance hierarchy gets deeper. The exact behavior depends on how many base classes must be traversed before the slot descriptor is found.

Consistency in ZODB

permanent link
Wednesday, May 14, 2003

ZODB uses optimistic concurrency control. It can be hard to describe what isolation guarantees it provides, because optimistic implementations don't map naturally onto the ANSI isolation levels.

Atul Adya wrote a thesis that describes isolation levels using directed graphs. A node in the graph is a transaction, and an edge is a dependency between the two transactions. These definitions make sense for optimistic and multi-version schemes as well as traditional locking implementations.

I've made some notes about Adya's thesis, and about how they apply to ZODB. I had hoped to understand how the formalisms apply to ZODB, but I'll need to spend some more time on it.

Adya's thesis is available online -- Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions -- and so is a shorter conference paper that presents that basic graph notion -- Generalized Isolation Level Definitions.

There are three kinds of direct dependencies between transactions. (See section 3.1.4 of Adya' thesis.)

write-depends
Ti installs a version of x and Tj installs a new version of x, where install means writes x and commits.

read-depends
Ti installs xi and Tj reads xi.

anti-depends
Ti reads xi and Tj installs xj, where xj is the next version of x

ZODB reports two kinds of conflict errors: read conflicts and write conflicts. A read conflict occurs while a transaction is executing. A write conflict occurs while a transaction is committing. ZODB uses backward validation, which means that a committing transaction is compared against all earlier committed transactions for conflicts.

When a transaction commits, each object it wrote is checked for conflicts. If the revision of the object the transaction read is not the same as the current revision in the database, a ConflictError is raised. This means another transaction installed a new version of the object while the current transaction was running. The explanation in Adya's terminology goes like this:

There are three transactions to consider:

The history of activities would be something like this:

w1 c1 r2 r3 w3 c3 w2 c2
(w -- write, r -- read, c -- commit).

This history is invalid, and ZODB wil report a conflict. There are several dependencies present in this history:

  1. T2 directly read-depends on T1
  2. T3 directly read-depends on T1

  3. T3 directly anti-depends on T2

  4. T1 directly write-depends on T2
  5. T2 directly write-depends on T3

The resulting graph has a cycle between T2 and T3, resulting from edges #3 and #5. T2 reads the object and T3 installs the next version, creating an anti-dependency (#3). But if T2 were to commit, it would create a write-dependency in the other direction (#5).

One significant drawback of ZODB's consistency mechanism is that reads are not tracked explicitly. If an object is loaded in memory when a transaction starts, it is possible to see an inconsistent view of the database. The cause appears to be efficiency. Persistent objects provide no read notification to their database connection. As a result, the database has no idea what objects a transaction reads unless it is asked to fetch the object from storage. When a transaction starts, any objects modified by earlier transactions are flushed from the cache. So the window of vulnerability is at most the length of a transaction.

It's hard to say if what isolation level ZODB provides given this problem with reading stale data. It certainly provides PL-2 isolation (basicly equivalent to ANSI read committed), since PL-2 allows inconsistent reads. Adya presents PL-2+ as the weakest level that provides basic consistency (originally defined by Weihl) provided that updates are serializable.

Definition 11 : Basic-Consistency. A transaction Tj is provided basic-consistency if the values read by Tj are the result of a serial execution of some subset of committed update transactions and each update transaction in the serial execution executes the same steps as it did in the concurrent execution.

In practice, this means that a programmer could rely on PL-2+ if a transaction was read-only or if its writes did not affect consistency, because, e.g., objects written are not involved in consistency constraints.

Adya distinguishes between consistency at commit-time and at execution-time. Commit-time consistency means that an invalid transaction is not allowed to commit. Execution-time consistency is just as important; it means that an executing transaction is not allowed to see an inconsistent state, even if it will eventually abort.

Execution-time consistency is important for data structures that involve constraints among multiple objects. One BTrees bug we chased down was caused by a failure of execution-time consistency. One transaction add enough keys to a BTree bucket to cause it to split. A bucket split modifies two objects; in addition to the bucket itself, the parent is modified to add a new key for the new bucket. In the buggy case, a second transaction started and saw the modification to the bucket but not the parent. As a result, it added new keys in the wrong bucket; because this transaction did not see the new key in the parent, it would put keys that belong in the new bucket in the old bucket.

Several years ago, Andrew Kuchling asked if ZODB supports ACID:

I don't really see how the ZODB supports Consistency, since there's no mechanism for specifying integrity constraints within ZODB itself, though you can certainly write such checks into your classes. For example, you can't enforce that the .actor attribute is a User object that must be present in a given list.

I think he misunderstands was consistency means for ACID transactions. The BTrees example illustrates how database consistency is necessary to be able to program certain kinds of applications. Features like referential integrity and type-checking can be helpful, but are different than consistency / isolation.

ZODB provides execution time consistency by raising read conflict errors. A simple explanation of read conflict errors requires some extra details about how ZODB works. ZODB transactions execute in parallel. When one connection commits a transaction, it sends an invalidation message to all the other connections; the invalidation messages lists all the objects that were modified by the committed transaction. The next time another connection commits a transaction, it will apply the invalidation; i.e. it will ghostify all the objects in the invalidation message.

If a connection is asked to load an object from the database and it has received (but not applied) an invalidation message for the object, it raises a ReadConflictError. This error is necessary to avoid an inconsistent read. When ZODB loads an object from storage, it always reads the most recent revision. Since the connection has an invalidation message for the object, the object was modified after the transaction began. If it reads the current revision of that object, it could be inconsistent with other objects read before the transaction began.

I believe a read conflict would be allowed by PL-2 but not by PL-2+. The cycle in graph for a read conflict involves a read dependency and an anti-dependency. See example on p. 68. The example involves two objects with the constraints that x + y >= 0. The Hbroken history could raise a ReadConflictError in ZODB, unless the read object was already in the cache.

If ZODB was modified so that ReadConflictError was possible when reading objects from the cache, then it would clearly provide PL-2+. It may also provide enough of a hook to provide full serializability for transactions that ask for it. The example transaction in the PL-3 section (3.2.3) would not work in ZODB; a conflict error would occur. I should try to understand what phenomena ZODB does allow, described in terms of a serializability graph. It may be that PL-3 is also achievable at relatively low cost. One possibility is to keep the read set locally and compare it to the invalidation set when a transaction commits. I think this would work unless an invalidation message was in transit at commit time.

Hackers, Writers, and Extreme Programming

permanent link
Tuesday, May 20, 2003

Paul Graham's essay Hackers and Painters prompted a long thread on the ll1-discuss mailing list.

I think Paul has a good argument, but I disagree on some of the particulars. The crux of his argument is that programmers are makers, like writers and painters. In particular, Paul's model of painting seems to be the solitary genius model, and that leads him to question whether agile programming made any sense. (The points on which we disagree may have something to do with differences between painting and writing.)

Paul wrote:

I've heard of this agile programming thing. (Surely "agile" isn't the best word, btw; surely one hacker is at least more *agile* than two.) There may well be cases where several people can work on code. I don't think I could, though. It's fine to have code reviews. Such code reviews could even take the form of presenting someone with a rewritten version for them to adopt if they chose. But I don't think I could stand more than that. Nor would I presume to do more than that to code someone else owned.

Guy L. Steele, Jr. offered some positive experience on pair programming, although I don't know if extends to common code ownership:

FWIW, the early Scheme interpreters were the result of pair programming (Sussman and I working together, and often trading places at the keyboard). The early Scheme papers were the result of pair writing, where if a paragraph wasn't flowing smoothly from one guy's fingers then the other would jump in with a reformulated sentence.

I don't know if I do much extreme programming -- a particularly hyped agile methodology -- but it's basic practices make a lot of sense to me. It makes sense because it similar to the way I know how to write and because it seems to promote readable programs.

I wrote for my college newspaper, and, in that medium, I found writing to be a collaborative activity. Every story was edited by a least two people, and we usually wrote headlines in pairs. I remember a few productive writing sessions where I shared the keyboard with an editor. Even when I was working by myself, I found it helpful to imagine a conversation between me and a reader.

Paul says that if there is one quote about programming people should know, it is this one from original preface to Structure and Interpretation of Computer Programs:

Programs must be written for people to read, and only incidentally for machines to execute.

In several articles, Paul has argued the easiest program to read is the shortest one. He has a point, but only if you don't push it too far. I think Strunk and White got it right:

Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all his sentences short, or that he avoid all detail and treat his subjects only in outline, but that every word tell.

The shortest possible program, however, does not seem like the clearest program in all cases. A short program can easily become a cryptic program that is a puzzle to decipher. So clarity should be the goal rather than brevity. I think comments can do a lot to increase clarity; for example, explaining what variables are covered by a lock or condition variable. If the program is correct, you can figure it out by reading the code; so the information is redundant. But it sure helps to have it written down somewhere. (You may be able to figure it out even when the code is incorrect, as long as it is usually correct. )

It becomes more important to have readable code when you are working in a group. Any sufficiently long-lived project will have many people reading the code, and they need to share some common understanding of what the code does and some common appreciation of good code. (Paul is right on this count, too: taste counts.)

The criticism of common ownership of code on ll1-discuss was that having multiple owners violates the conceptual integrity of the code. Paul again:

When a piece of code is being hacked by three or four different people, no one of whom really owns it, it will end up being like a common-room. It will tend to feel bleak and abandoned, and accumulate cruft.

I think many of the XP practices are intended to avoid the accumulation of cruft. Pair programming, refactoring, collective ownership, and coding standards all mean that everyone is responsible for fixing bad code.

The other argument is that a system with a single author is more likely to have conceptual integrity. This seems more a complaint about the XP design process -- that having many people modifying the code without serious up-front design is a bad idea. I admit to being ambivalent about XP's approach to design, but I haven't done a project using all the XP practices.

Most of the projects I work on favor some amount of upfront design. The Python PEP process recommends drafting the specification part of the PEP before starting the implementation. Perhaps this would be considered lightweight design because it is fairly informal and it is expected to change once implementation begins.

On the other hand, there's another great quote in SICP that seems an excellent match for XP practices. You often don't know what exactly you're trying to build when you start the design process. So it would be foolish to spend too much time working out the design. Build a little first, then see what you learn from the exercise.

SICP acknowledgments:

Marvin Minsky and Seymour Papert formed many of our attitudes about programming and its place in our intellectual lives. To them we owe the understanding that computation provides a means of expression for exploring ideas that would otherwise be too complex to deal with precisely. They emphasize that a student's ability to write and modify programs provides a powerful medium in which exploring becomes a natural activity.