[SciPy-Dev] Proposed addition to scipy.sparse.csgraph.maximum_flow

Sat Jan 18 01:49:46 EST 2020

Hi mhwende,

There's definitely some interesting observations in here! I'll let someone more familiar with SciPy decide what the bar for acceptable complexity should be here, but certainly the DFS change seems minimal enough to not be harmful.

Regarding the changes to the initialization functions called prior to the actual optimization: You mention that your inputs are non-sparse. When originally working on the benchmarks in https://github.com/scipy/scipy/pull/10566 I tried a few different sparsities. I seem to recall that I found the NumPy route preferable for all things initialization, but I wonder if that simply has to do with the sparsity of the inputs I used during testing. In particular, I would have expected your use of `transpose` to be much slower but your numbers certainly suggest otherwise. Would you be able to repeat your experiment with sparse inputs to see if the improvements are consistent?

Not to turn this into a code review already, but I did have a few thoughts while reading through:

- I didn't check for correctness, but while the new initialization code is a bit longer, it's also much more explicit which could be a maintenance pro.

- Regarding the naming, having DFS as part of the signature for _edmonds_karp irked me slightly as -- at least in the literature I know of -- Edmonds--Karp really simply is Ford--Fulkerson with the further specification that augmenting paths be found using BFS. So really the new method is more like "Ford--Fulkerson with DFS".

- Along the same lines, I would suggest allowing for a more general method description in `maximum_flow` itself than just a boolean to toggle between search order: When we're talking about finding performance improvements to the existing method, other more or less low-hanging fruits would be to include some of the other solvers for the same problem (push-relabel would be a natural first candidate). Indeed, doing so might lead to improvements that are even greater than the ones your tweak gave rise to (of course at the cost of complexity). But as such, it might be preferable to leave open the option to do so by having something like the "method" parameter from scipy.optimize, and using that instead of the simpler bool.

- Even if it seems to be the natural choice for your inputs, I'm in favor of keeping BFS the default, as you also did. DFS has exponential worst case complexity [0] while Edmonds--Karp is polynomial time.

Regards,
Søren

[0]: https://people.cs.clemson.edu/~bcdean/iterm.pdf

On Thu, Jan 02, 2020 at 02:25:35PM +0000, mhwende wrote:
> Hi,
> 
> recently I have made use of the maximum flow algorithm for
> CSR matrices, i.e., the scipy.sparse.csgraph.maximum_flow function.
> 
> My test case was a non-sparse, bipartite network with lots of connections
> between the two node sets. The value of the maximum flow was small in
> comparison to the total number of nodes and edges in the graph.
> I wanted to use the scipy.sparse.csgraph module, however with a depth first
> search (DFS) instead of the breadth-first search that you have implemented.
> Therefore, I have implemented a new option to use DFS instead of BFS with
> your implementation. When measuring the execution time for a random graph,
> there was a signnificant speedup of the DFS procedure compared to the
> BFS variant.
> 
> During these tests, I also refactored the two other functions that you call
> within the main csgraph.maximum_flow function, i.e., _add_reverse_edges and
> _make_edge_pointers. Depending on the input graph, I found that the time
> spent in this functions can be larger than the execution time for the actual
> Edmond-Karps algorithm.
> 
> Below, you can find a table showing the measured execution times in seconds.
> Within the columns of this table, the execution time is broken down into parts
> as follows:
> 
> a) Adding reverse edges (function _add_reverse_edges).
> b) Making edge pointers (function _make_edge_pointers).
> c) Actual max flow computation (function _edmond_karps)
> d) Total execution time (function maximum_flow).
> 
> Rows of the table refer to the changes which were made to the implementation,
> where each measurement was performed with either DFS or BFS for the
> construction of augmenting paths:
> 
> 1) Original implementation.
> 2) After the reimplementation of _make_edge_pointers.
> 3) After the reimplementation of _add_reverse_edges.
> 
> Here are the execution times:
> 
>         |     a)    |     b)    |     c)    |     d)
> --------+-----------+-----------+-----------+-----------
> 1. BFS  |   1.29 s  |   3.81 s  |   8.93 s  |  14.08 s
> 1. DFS  |   1.29 s  |   3.82 s  |   0.50 s  |   5.66 s
> --------+-----------+-----------+-----------+-----------
> 2. BFS  |   1.54 s  |   0.22 s  |   8.94 s  |  10.76 s
> 2. DFS  |   1.33 s  |   0.22 s  |   0.50 s  |   2.10 s
> --------+-----------+-----------+-----------+-----------
> 3. BFS  |   0.16 s  |   0.27 s  |   9.01 s  |   9.50 s
> 3. DFS  |   0.16 s  |   0.23 s  |   0.52 s  |   0.96 s
> 
> By comparing the total execution times for BFS in 1d) and for
> DFS in 3d), my code changes have led to a speedup of about 14,
> for the network example which I have mentioned.
> 
> Please let me know if would like to incorporate some or all of my
> changes into your code base.
> My changes (a single commit) can be found here:
> https://github.com/mhwende/scipy/commit/e7be2dc29364fe95f6fae20a4c0775716cdba5e5
> 
> Finally, here is the script which I have used for testing.
> The above timings have been produced with the command:
> 
> python scipymaxflowtest 2000 2000 --dfs=0
> 
> This will create a graph with 4000 nodes and 4,000,000 edges,
> with randomly chosen zero or unit capacities.
> 
> # File scipymaxflowtest.py
> 
> import time
> import argparse
> import numpy as np
> import scipy.sparse as sp
> 
> 
> def scipymaxflowtest(layer_sizes, dfs):
>   layer_sizes = np.array(layer_sizes, dtype = np.int)
>   num_layers = len(layer_sizes)
>   print("Layer sizes:", *layer_sizes)
>   print("Number of layers:", num_layers)
>   num_nodes = 2 + np.sum(layer_sizes)
>   print("Number of nodes:", num_nodes)
>   num_edges = layer_sizes[0] + layer_sizes[-1]
>   for i in range(layer_sizes.size - 1):
>     num_edges += layer_sizes[i] * layer_sizes[i + 1]
>   print("Number of edges:", num_edges)
>   #data = np.ones(num_edges, dtype = np.int)
>   data = np.random.randint(0, 2, num_edges)
>   indices = np.zeros(num_edges, dtype = np.int)
>   indptr = np.zeros(num_nodes + 1, dtype = np.int)
>   ptr = 0
>   for j in range(layer_sizes[0]):
>     indices[ptr] = 2 + j
>     ptr += 1
>   indptr[1] = ptr
>   indptr[2] = ptr
>   layer_sum = 0
>   for k in range(num_layers - 1):
>     next_layer_sum = layer_sum + layer_sizes[k]
>     for i in range(layer_sizes[k]):
>       for j in range(layer_sizes[k + 1]):
>         indices[ptr] = 2 + next_layer_sum + j
>         ptr += 1
>       indptr[2 + layer_sum + i + 1] = ptr
>     layer_sum = next_layer_sum
>   for i in range(layer_sizes[-1]):
>     indices[ptr] = 1
>     ptr += 1
>     indptr[2 + layer_sum + i + 1] = ptr
>   indptr[num_nodes] = ptr
> 
>   # Check that we set as many entries in the index array as there are edges.
>   assert ptr == num_edges
> 
>   # Create CSR matrix from data and index arrays.
>   a = sp.csr_matrix((data, indices, indptr), shape = (num_nodes, num_nodes))
>   if num_nodes <= 12:
>     print("A =\n{}".format(a.toarray()))
> 
>   t0 = time.time()
>   flow_result = sp.csgraph.maximum_flow(a, 0, 1, dfs)
>   t1 = time.time()
>   flow_value = flow_result.flow_value
>   print("Maximum flow: {} (took {:.6f} s)".format(flow_value, t1 - t0))
>   return flow_value
> 
> 
> if __name__ == "__main__":
>   parser = argparse.ArgumentParser()
>   parser.add_argument(
>       "layer_sizes", type = int, nargs = "+",
>       help = "Number of nodes on individual graph layers.")
>   parser.add_argument(
>       "--dfs", type = int, default = 0,
>       help = "Whether to user depth first search.")
>   args = parser.parse_args()
>   np.random.seed(4891243)
>   scipymaxflowtest(args.layer_sizes, args.dfs)
> 
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev