Response to Guido's essay on graph representation

Thu Apr 4 23:31:17 EST 2002

For me David's code emphasizes the importance of well designed data
structures!  In particular his implementation of a priority queue (his
priorityDictionary class).  I love it, and here's why...

<long rant on optimization in Python>

I've recently been fooling around with various graph algorithms, and had
implemented a variety of "naive" implementations of classics like Prim's and
Kruskal's MST algorithms, Djikstra's shortest path, Floyd's all shortest
paths, etc.

My implimentations worked ok, but for even moderately large data sets they
were SLOW.  I figured that I was just bumping up against Python's limits,
and so decided to reimpliment the Kruskal and Prim algorithms using
structures from C extension modules.

I first reimplemented the Kruskal algorithm using the kjGraph data structure
from Aaron Watters module kjbuckets, which resulted in a considerable speed
up, but was still somewhat pokey.  After going back and reading Cormen et
al.'s Introduction to Algorithms (highly recommended!) I realized that  part
of my problem was that I wasn't using a disjoint-set data structure.  I
coded up the disjoint-set representation in pure Python and lo and behold,
the Python implementation using the "right" data structure was several times
faster than the C extension module using the "wrong" data structure!
Ka-ching!

With my renewed appreciation for the importance of data structures in hand I
turned to Prim's algorithm.  The standard approach for this algorithm is to
use a priority queue, however my initial naive implementation had been
abysmally slow.  Again I turned to somebody elses C extension for priority
queues which used Fibonacci heaps.  My implementation of Prim's based on
this was pretty darn fast, and decided I'd found the proper solution for my
needs, despite the fact that I had trouble making heads or tails out of the
fibonacci heap C code (I hate taking a black-box approach!).

Today I stumbled across David's implementation of priority queues.  I
decided to give it a shot as the underlying data structure for my
implementation of Prim's.  Dropping it into my implementation was trivial
(only a few lines of code needed to be changed) and I threw some test data
sets at it.  How did it do?  Amazingly well!

Check out these stats from runs with the profiler module for a 1000 vertex,
(1000 choose 2) edge graph.  Remember, that using the profiler adds a good
chunk of time to the runs but is good for getting a sense of relative
speed...

First the Fib. heap based C extension module:

>>> profile.run('mst.mst_prim(Avert, Adist)')
         3003 function calls in 15.172 CPU seconds
<< details snipped >>

Now, the priorityDictionary results:

>>> profile.run('pd.mst_prim(Avert, Adist)')
         62658 function calls in 20.383 CPU seconds
[ details snipped, but about 3.5 seconds is just calls to __setitem__ as I
filled up the queue ]

Obviously the C-based data structure is still faster, but I can understand
what David's implementation is doing, while the C implementation would take
quite some slogging through to understand.  That for me is worth the
tradeoff!

</long rant>

Of course, anecdotes about Python optimization are not uncommon - either in
this newsgroup or on the web (see Guido's classic here:
http://www.python.org/doc/essays/list2str.html ).  However, running into
real world examples in my own day-to-day work puts a real smile on my face.

Go Python!

Ciao,
Paul