[SciPy-Dev] Graph shortest path

Mon Dec 19 19:10:47 EST 2011

On Mon, Dec 19, 2011 at 10:03:29AM -0600, Warren Weckesser wrote:
>    Networkx ([3]http://networkx.lanl.gov/) already provides an assortment of
>    data structures and algorithms for graphs.  It is an active,
>    well-documented project.  Personally, I'd rather not start adding code to
>    scipy that duplicates it.   If your code is better/faster/stronger than
>    theirs, why not contribute it to networkx?

I think that there is some room for a lower-level and more efficient base
structure from graphs in the scipy world, and that scipy is well suited
for that. Simple graph algorithms are common to many applications. For
instance a lot of the sparse linalg needs basic graph algorithms. It
seems to that having sparse graph algorithms in scipy (and we already
have some under the sparse package) would help the scipy world sharing
code with regards to graph algorithms. For instance, the scikit-image,
the scikit-learn, and pyamg all have shortest path algorithms.

NetworkX is not well-suited for purely numerical application. First of
all, it is pure-Python, which has a lot of overheads as many of the graph
algorithms that cannot be implemented as array or matrix computation.
Second, it is using a dict of dicts representation to store and
manipulate graphs. The reason that they are using this representation is
that it enables to store rich objects for the nodes, and to dynamically
allocate and remove nodes. Such choices are great for the complex-systems
types of problems that they are targeting. However, for number crunching
applications, it is not great, as it has a significant memory overhead,
and prevents calling easily compiled code. 

To put a few numbers on the computation cost, and the memory overhead, I
did some small benchmarks on computing the shortest paths of a graph,
something that we need in the scikit-learn. I compared the running time
of the algorithm, but I also benched the time necessary to convert the
data structures:

    In [1]: from sklearn.utils import graph_shortest_path as skl_graph

    In [2]: import networkx as nx

 * For medium sized graphs (negligeable memory cost):

    In [3]: G = nx.gnm_random_graph(100, 100, seed=0) 

    In [4]: mat = nx.to_scipy_sparse_matrix(G).asformat('lil')

    In [5]: %timeit skl_graph.graph_shortest_path(mat, directed=False)
    100 loops, best of 3: 3.61 ms per loop

    In [6]: %timeit nx.from_scipy_sparse_matrix(mat)
    1000 loops, best of 3: 1.02 ms per loop

    In [7]: %timeit nx.shortest_paths.all_pairs_dijkstra_path_length(G)
    10 loops, best of 3: 50.3 ms per loop

 * For biggish graphs (nx uses 800 Mb, scikit-learn around 10 to 20 Mb):

    In [8]: G = nx.gnm_random_graph(5000, 5000, seed=0)

    In [9]: mat = nx.to_scipy_sparse_matrix(G).asformat('lil')

    In [10]: %timeit skl_graph.graph_shortest_path(mat, directed=False)
    1 loops, best of 3: 15.8 s per loop

    In [11]: %timeit nx.from_scipy_sparse_matrix(mat)
    10 loops, best of 3: 59.6 ms per loop

    In [12]: %timeit nx.shortest_paths.all_pairs_dijkstra_path_length(G)
    1 loops, best of 3: 197 s per loop

With 10000 nodes, the above code does run on my laptop as it uses more
than the available 2G of RAM.

In addition, basic graph algorithms on numerical structures are needed
for a lot of applications. People who need efficient implementation, like
scikit-learn, scikit-image, or pyamg, will reimplement them. It seems
profitable to have scipy as a common resource for sharing them.
Scipy.sparse has all the necessary tools for that.

Don't get me wrong, I love networkX and use it often because it has so
many algorithms and is so versatile. I just think that a few graph
algorithms are central and deserve a special treatment.

Cheers,

Gael