Jeremy Hylton : weblog : September 2003 last modified Thu Mar 17 01:11:16 2005

Jeremy Hylton's Web Log, September 2003

Python in Byte

permanent link
Monday, September 01, 2003

The article that Cameron Laird, Alex Martelli, and I wrote for Byte was published today. It hits many of the highlights of the Python 2.3 release. There weren't a lot of big changes, but the sheer mass of small changes adds up.

I have intended to write more about Python for several years now. I'm pleased to have finally done something, but those muscles are a little sore. It's been a long time since I've written regularly.

New Blog Software

permanent link
Monday, September 08, 2003

I've got some very simple new software for generating my web log pages. In June, I realized that I didn't like the software I was using for my web log. (See "Your blog software sucks.")

It is the simplest thing that could possibly work. I'm writing the actual entries as HTML fragments, just like an ht2html page. I have to re-integrate my RSS feed generator. It should be much simpler than the last one, because I don't need to scrape the BlogMax output.

Scaling to 100,000 Threads

permanent link
Friday, September 12, 2003

A paper at SOSP this year is the latest entry in the long-standing debate about threads versus events. I have always tended to favor the threaded view, primarily because the event-based approach obscures the static control flow of the program.

The paper, by von Behren et al., is about Capriccio, a scalable threads packages for C programming. The basic argument is that several of the problems associated with the threaded approach have a lot to do with implementation issues, e.g. the amount of memory used for each thread's stack.

The Capriccio package is a user-level thread package for Linux that scales to 100,000 threads. Many of the implementation challenges are the result of providing a thread package for C programmers. They perform elaborate compile-time analysis to determine how much stack space is needed and insert checks for dynamic resizing. It is also a challenge to avoid blocking IO operations, which would block the entire process.

An earlier HotOS paper about Capriccio engages more in the debate on threads vs events. It's a nice companion to the implementation paper. The best argument against events is the "stack-ripping" problem (a term coined by Adya et al.). At each point that a blocking call could occur, you need to rip apart the stack and create a closure invoked via callback. I dislike this because it obscures the intent of the program.

Many higher-level languages rely on user-level threads. Erlang achieves amazing scalability for its threads package. PLT Scheme also uses user-level threads, though I don't know anything about scalability.

I need to understand whether Stackless Python provides a similar model. I haven't followed the Stackless 3.0 work closely, so I don't know the state of the threads package. Christian Tismer's slides for EuroPython suggest that you can create many threads, but they aren't intended for blocking IO.

I think a Python threads package for scalable internet services is possible and would be a great project. In Python, the stack management isn't so much of an issue because all the Python frames are heap allocated. Stackless, of course, contends with the places where Python still uses the C stack. (I wonder if you could create a variant Python frame for C functions; the frame would contain a pointer to the C function but could be allocated on the heap.) You would need to provide IO operations, e.g. sockets and threads, that presented a traditional blocking interface implemented on top of nonblocking primitives like poll and AIO. Unfortunately, I'm not aware of any papers that get into the details of these issues.

Stackless Python is currently pursuing a different model for concurrency based on tasklets. I'd like to find out more about them. I suspect they're related to Erlang's four concurrency primitives.

One problem with the current Capriccio implementation is that it doesn't take advantage of multiple processes. It's far more difficult to map M user-level threads onto N kernel threads. (Apparently the events camp has worked on this problem, but I haven't read the recent Zeldovich paper yet.) In a Python implementation, the problem might be worse because of the global interpreter lock.

Papers mentioned

Atul Adya, Jon Howell, Marvin Theimer, Bill Bolosky, John Douceur. Cooperative Task Management Without Manual Stack Management, or Event-driven Programming is not the Opposite of Threaded Programming. In Proceedings of the USENIX Annual Technical Conference, June 2002.

Rob von Behren, Jeremy Condit and Eric Brewer. Why Events Are A Bad Idea (for high-concurrency servers). In Proceedings of the 9th Workshop on Hot Topics in Operating Systems, 2003.

Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer. Capriccio: Scalable Threads for Internet Services. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003.

John K. Ousterhout. Why Threads Are A Bad Idea (for most purposes). Presentation given at the 1996 Usenix Annual Technical Conference, January 1996.

Nickolai Zeldovich, Alexander Yip, Frank Dabek, Robert T. Morris, David Mazieres, Frans Kaashoek, Multiprocessor Support for Event-Driven Programs, USENIX 2003 Annual Technical Conference, June 2003.

A Plan for Improving Web Programming in Python

permanent link
Wednesday, September 17, 2003

Bill Janssen will champion a Python Web SIG to improve standard library support for Web programming. This work is long overdue.

When the web was still a new thing, Python had pretty good libraries for web programming. When I started using Python in 1996, httplib and urllib seemed like the state of the art for web programming. Not any more. I've had many serious Python developers tell me that they prefer PHP for basic web programming.

There are great web frameworks written in Python, like Zope and Quixote. The frameworks are good for building dynamic web sites or serious content management, but that's a narrow part of web programming.

urllib2 was designed to address a small corner of the problem, writing clients that fetch and process web pages. I finished the first draft just before we left CNRI and could never make time to work on it more. John J. Lee has a lot of good ideas about making it really useful. He's done work like ClientCookie, which provides really necessary cookie support.

I'm hopeful that good things will happen.

Dr. Tara Gilligan

permanent link
Thursday, September 18, 2003

My wife defended her dissertation today. Hooray!

The title is Constructing a Moral Life: Literature and the Ordinary Moral Agent First Reader. Her advisers were George Wilson at UC Davis and Richard Bett at Hopkins.

She does not have a web page of her own, but there's a very brief bio page at Hopkins.

We drove to and from Baltimore today, dodging Hurricane Isabel. Hopkins was closed because of the hurricane, but the defense went on as scheduled. The hurricane did not cause much disruption. A large tree fell over during the defense and made a big crash. After the defense, Jerry Schneewind had everyone over for drinks; the power went out there shortly after we got arrived, so we relied on candle light.

Python 2.3.1 released

permanent link
Tuesday, September 23, 2003

The Python 2.3.1 this evening is a big milestone in Python development. Guido did not have any involvement in planning the release, and Tim did not build the Windows installer. Anthony Baxter and Thomas Heller did all the work.

It is good to see the larger developer community get involved in the release process. PythonLabs used to take the primary responsibility for these releases, but then we were getting paid to do it. It looks like Python will get along fine without us.

The new release does not have a lot of bug fixes. You can take that two ways. On the one hand, it means that the original 2.3 release was pretty stable. On the other hand, it means few people beyond Raymond Hettinger and Martin von Löwis have a lot of time for bug fixing. At last report, there were 475 open bugs and 193 open patches. That backlog has grown steadily as the number of people paid to work on Python has declined.

ZODB 3.2 release, too

We also released ZODB3.2b3 yesterday. I think ZODB 3.2 will serve everyone well: ZConfig is a good configuration language. And the new daemon management code that Guido wrote should be simpler to use. (I wish it had more testing.)

This release was delayed far longer than I expected. The previous beta was in July and the original plan was for a final release in spring 2003. We got interrupted by a lot of customer work and both shifting priorities. In general, we need to be better about devoting attention to a release during the beta phase. We tend to get a beta out then find other things to work on. In the long run, that just creates more work for everyone. Customers would like the latest bug fixes, but they don't want to run a beta. So developers spend lots of time porting fixes between various released and beta versions.

Python Package Index Tutorial

permanent link
Wednesday, September 24, 2003

PyPI: the Python Package Index is the latest attempt to create a comprehensive catalog of third-party Python packages. The catalog is integrated with distutils. This tutorial explains how to use setup.py to create PyPI entries.

There are six simple steps to follow to create a setup script that will work with PyPI. Once the script is written, it requires very little maintenance to update the index on each subsequent release.

  1. Register with PyPI
  2. Collect metadata
  3. Add metadata to setup.py
  4. Check PKG-INFO
  5. Run register command
  6. Check the listing on python.org

Register with PyPI

The first step is to complete the user registration form. When you create or update an entry, you need to provide a username and password. The goal, I presume, is to prevent someone else from modifying your entries.

The PyPI authentication is very weak. The username and plaintext password are passed as part of the form data. Anyone who can guess your username and password can impersonate you. The password is transmitted in cleartext and a simple hash is stored on the server; it's vulnerable to dictionary attacks and simple theft.

You need to save the username and password in a .pypirc file in your home directory. For example:

[server-login]
username:jeremy
password:aaaaaaabb
distutils will read this information when you run the register command.

Collect metadata

You need to provide metadata in your setup.py script. When you run python setup.py register, the script will package up the metadata and submit it to python.org. The metadata is described in PEP 241: Metadata for Python Software Packages.

The metadata elements are passed as keyword arguments to the setup() call. Some of the metadata, like name and version, is used to create the file names for distributions. Others, like the Trove classifiers, are only used by PyPI.

Necessary metadata
Name
The name of the package
Version
A version number like 3.1.4 or 1.0a3
Summary
A one-line summary of what the package does. It's like the first line of a doc string.
Home-page
The URL of the package's home page
Author
The name of the author
Author-email
The email address of the author. (PEP 241 mentions that this might be used as a unique key in some catalog of packages, but I don't know if it actually is.)
License
PEP 241 says to put the name of the license here, but I don't think that's a good idea. The name doesn't identify a specific license. There are several Zope licenses and several PSF licenses. I recommend putting the URL of the license.
Description
A longer description of the package. PEP 241 says this is optional, but it seems too useful to omit. If someone is searching a package index, how else will they know what your package does?
Platform
A comma-separated list of supported platforms. It's not clear what exactly this is used for, so for code that should run anywhere you could just say "Any."

You can also include an arbitrary number of Trove classifiers. The classifiers describe the software according to a predefined vocabulary. It answers questions like: "What is the intended audience of the package?" and "What is its development status?"

Note that there is some overlap between the Trove classifiers and the other metadata. The classifiers include entries for license and platforms supported. It seems to me that you ought to provide the information in both places, because different software may only look in one place or the other.

Add metadata to setup.py

I'll start with a concrete example -- the parts of ZODB's setup.py that relate to metadata.

"""Zope Object Database: object database and persistence

The Zope Object Database provides an object-oriented database for
Python that provides a high-degree of transparency. Applications can
take advantage of object database features with few, if any, changes
to application logic.  ZODB includes features such as a plugable storage
interface, rich transaction support, and undo.
"""

classifiers = """\
Development Status :: 5 - Production/Stable
Intended Audience :: Developers
License :: OSI Approved :: Zope Public License
Programming Language :: Python
Topic :: Database
Topic :: Software Development :: Libraries :: Python Modules
Operating System :: Microsoft :: Windows
Operating System :: Unix
"""

from distutils.core import setup

if sys.version_info < (2, 3):
    _setup = setup
    def setup(**kwargs):
        if kwargs.has_key("classifiers"):
            del kwargs["classifiers"]
        _setup(**kwargs)

doclines = __doc__.split("\n")

setup(name="ZODB3",
      version="3.2b3",
      maintainer="Zope Corporation",
      maintainer_email="zodb-dev@zope.org",
      url = "http://www.zope.org/Wikis/ZODB/FrontPage",
      license = "http://www.zope.org/Resources/ZPL",
      platforms = ["any"],
      description = doclines[0],
      classifiers = filter(None, classifiers.split("\n")),
      long_description = "\n".join(doclines[2:]),
      )

There are only a few interesting things about the specific code. First, Python 2.3 is the earliest Python version that understands the classifiers keyword. If you want the setup script to work with earlier Pythons, you need to add some kind of workaround. (distutils wasn't designed for graceful evolution. It complains about arguments it doesn't understand.)

I create the description and long_description from the script's docstring. It seems convenient to have the information in a regular docstring, because that's what I'm used to doing with other modules.

The classifiers must be passed as a list of strings. I write them in a block as a triple-quoted string and then split them into individual strings in the setup() call. platforms also expects a list of strings.

Check PKG-INFO

You can use the distutils PKG-INFO file to debug the metadata you entered in setup.py. When you create a distribution using setup.py, distutils includes a PKG-INFO file that contains all the package metadata. When you run, python setup.py sdist, distutils builds a source tarball and puts it in the dist directory. The tarball contains a PKG-INFO file in the top-level directory.

It's a little inconvenient to read the extra PKG-INFO, but it is helpful to double-check your metadata before uploading it to python.org.

Run register command

You should have an account, with username and password in .pypirc, and a setup.py script with all the metadata. Now run:

python setup.py register

That's it.

Check the listing on python.org

Go to http://www.python.org/pypi. You should see your package in the list of the last 20 updates. If you login, you will also see the package in the left navigation bar under the heading "Your Packages."

You can reach the newly generated package record by clicking on the name in the "last 20 updates" table. The individual package pages have a link that says "edit." You can use the edit form to correct any problems you discover after running register.


Thanks to Andrew Kuchling and Richard Jones for comments and corrections.

ZODB: Transaction ids, timestamps, serial numbers

permanent link
Thursday, September 25, 2003

This entry has detailed notes on transaction ids in ZODB -- probably a boring topic for most readers. Each storage associates a transaction id with a transaction. Operations like undo and history that need to refer to specific transactions, use this transaction id. The only constraint on the ids is that they be monotonically increasing.

Each object revision has a serial number. The serial number uniquely identifies a revision of the object. It is possible for different transactions to write data records with the same serial number. For example, an abort version operation will write a new data record with the same serial number as the last non-version data record. (The abort version case may be the only case where serial numbers are re-used.)

In most implementations, the transaction id is used for the serial numbers of each of the object revision. So if transaction id 12 commit changes to four objects, each object will get the serial number 12. I don't think there is any code that relies on this feature, though. I think it is just a convenient implementation technique.

The transaction ids are implemented using ZODB TimeStamp objects, although I'm not sure if that is part of the contract or just a detail of the implementation that has crept into widespread use. When the dump utilities for a storage print out the transaction headers, they transaction the transaction id using time.ctime(). For debugging and analyzing failures, it is convenient to read the ids as timestamps.

When ZEO communicates invalidations from server to client, it sends a set of invalidations along with the transaction id that generated them. Currently, the ZEO client stores the transaction id in its cache. When it needs to validate the cache, it requests all the changes since the last transaction id it received invalidations for. (If there are two many, it falls back to validating the serial numbers of every object.)

All of the standard storages use timestamps for transaction ids, relying on the laterThan() method of TimeStamp to guarantee that timestamps are always increasing. The repr of a TimeStamp object is an 8-byte string that is used as the id. (It's very weird to use repr() this way; in ZODB4, we used a method on the TimeStamp instead.) BaseStorage and ClientStorage have basically the same code. The tpc_begin() method takes a transaction id as an optional second argument, although the only place this is used it copyTransactionsFrom(). (Incidentally, the laterThan() call is still used for a supplied id, so there's no guarantee that the specified tid will actually be used.)

If a single transaction span multiple storages, each storage could pick a different id. The code uses the current time at seconds granularity as the default, so there's a decent chance that they will end up being the same. But it's also possible for two storages to generate timestamps before and after a clock tick, respectively.

I discussed some of these issues with Jim, and I discovered that he did not have a clear sense of the requirements for transaction ids. There's no written specification that I know of, and most of the implementations use the same techniques. We agreed, in the end, that we would declare that monotonically increasing ids was all the was required.

Lightweight Languages 2003 (LL3)

permanent link
Friday, September 26, 2003

The 3rd annual Lightweight Languages Workshop will be held at MIT on Saturday, Nov. 8. Presentation proposals are due on Oct. 17.

The first two workshops brought together an engaging mix of academics and industry / open source programmers. We hope to do the same this year.

Online music

permanent link
Sunday, September 28, 2003

I got a new set of speakers, and I'm listening to music and radio in my office. There is a problem with radio reception in our neighborhood. All of our radios suffer from an odd interference problem. One station comes in almost everywhere on the dial. We get a few other stations, but NPR and almost everything else I would listen to do not come in.

I got speakers for my Windows box at Best Buy. I decided to get the cheapest speakers they had, and came away with an Logitech Z-340 set for $30. My Cambridge Soundworks speakers got hijacked long ago for listening to music in the living room.

I like listening to NPR -- the morning and afternoon news shows, Fresh Air, Car Talk. I prefer to work with some kind of background noise, even when I'm not actively listening. NPR streams its content and has reasonably available servers.

Rhapsody is an online music service offered through my ISP, Speakeasy. It's also made for pleasant listening today. You can define a "radio station" by picking up to 10 artists. It creates a playlist based on those artists, but not limited to just them. You don't get to pick the songs or the order, but you can skip any song.

I've got a punk / alternative station and a jazz station that have both come up with some pleasant surprises. The punk station turned up a bunch of tracks from hard-to-find SST Records albums and several Mission of Burma tracks; none of them were on the list of 10 artists. I put Tommy Flanagan on my jazz playlist, and two of his tracks that I don't own came up in the first hour.

This is all a big improvement over the one radio station we do get: 99.9 The Hawk, a tiresome classic rock station that plays the Rolling Stones, The Who, and Led Zeppelin and the goes back to the Stones. (I like them all in moderation.) When I was in junior high school, the station was known as Q100 and was the "cool" Top 40 station. It seems I'll never escape it.

Boston.com powered by Zope

permanent link
Monday, September 29, 2003

Boston.com launched its new site a few weeks ago, and announced it in a press release today. The bottom of the press release mentions Zope:

Simultaneous with the Boston.com redesign was the implementation of a new content management system built using the Zope4Media open-source solution from Zope Corporation. The open-source platform will enable Boston.com's producers to integrate static and dynamic content from many different sources, test interactive tools, and roll out new section designs.

There is not a lot of information about Zope4Media on Zope Corp's web site. (Actually, there is information, but it's trapped in a PDF file. The PDF of Zope4Media has details.

There are print and broadcast versions of Z4M. The print version is used by Boston.com. The broadcast version is used by Viacom local television and radio stations.