Jeremy Hylton : weblog : 2004-01-22

Simple Python Aggregator

Thursday, January 22, 2004

I've been working on a very simple Python RSS aggregator as a way to learn more about RSS and to see what kinds of data you find in real life. I decided to package up the code and release it -- all 500 lines of it. It's sagg 0.1, a simple aggregator.

The simple aggregator (sagg) fetches a collection of RSS feeds and produces a simple HTML page containing all the entries sorted by time. It's a small package -- less than 500 lines of code -- intended to demonstrate some basic techniques.

I use Fredrik Lundh's ElementTree package so that I didn't have to deal with any low-level XML issues. ElementTree, in turn, uses Python's builtin XML parser. ElementTree has the cleanest API of any of the Python-XML bindings I've seen.

Sagg is very similar to Spycyroll, the former PyBlagg, started by Vattekkat Satheesh Babu. The chief difference is that Spycyroll uses Mark Pilgrim's Ultra-liberal feed parser and Sagg uses its own ElementTree-based parser. It parses fewer feeds, but it's much less code. Pilgrim's parser is about three times bigger than Sagg itself.

I tested the ultra-liberal parser against my ET parser. The results aren't as bad as I feared. I collected 2782 feeds from syndic8. The Expat-based parser choked on 276 of them, about 10 percent.