[issue17005] Add a topological sort algorithm

Tue Jan 14 21:04:00 EST 2020

Tim Peters <tim at python.org> added the comment:

> I am slightly confused about what .prepare() should do. Why
> is this step necessary?

To say "I'm done adding edges".  Any call to add() after prepare() should raise an exception.  Likewise, e.g., any call to get_ready() before prepare() should raise an exception.  In a bog-standard topsort implementation, saving for each node a sequence of successors and a count of predecessors, this is also the time to collect all the nodes that have no predecessors (the first batch returned by get_ready()).

Much the same could be done without prepare() by get_ready() making a special case out of the first time it's called.  That's more magical, though.  "I'm done adding edges" is utterly non-magical.

> - Why we need the .done() method here? Why not instead make get_ready()
> simply a generator so you can just write
>
>    for node in self.get_ready():

The point of done() is to enable maximum exploitation of parallelism.  As already sketched, if a user doesn't care about that, fine, a different method (like static_order()) can generate all the nodes in _some_ static topsort order, with no need for done().

But suppose a user does care about parallelism.  Consider graph

A -> B
A -> C
A -> D
B -> D

Output A B C D is a topsort, but useless unless the user is content to "do" one node at a time.

Instead get_ready() first returns [A] (or a tuple, or a generator, or a set ... something iterable).  A is handed out to worker processes/threads, but get_ready() will return an empty iterable until done(A) is called.  Indeed, if "doing" A fails, it's NOT the case that anything else can ever be started.

If/when "doing A" succeeds, then done(A) is called, and the next get_ready() returns [B, C].  Those can be done in parallel, but D can't be started until done(B) is called.  done(B) may or may not be called before done(C) is called - the topsort itself has no way to know in advance, nor _should_ it impose its own view of that.  Note that D does not depend on C, so there's no need to wait for _both_ in [B, C] to finish.  It's necessary and sufficient that B be marked done() for D to be ready to go.

> It seems that the .done() is very tight to use this API as a "task
> scheduler" but maybe I am doing something here in my understanding
> of the API.

done() says nothing about how the user "should" schedule work items, but instead allows get_ready() to return all the work items whose predecessors have been marked done() (but haven't already been passed out by get_ready()).  That's the maximum set of nodes that _can_ be worked on at the time.  The topsort class itself has no policy about how or when they "should" be worked on, get_ready() is just identifying all the possibilities that exist.  Which is impossible to know unless the class is also told which nodes it already passed out have finished - the purpose of done().

is_active() eventually returns False when all the nodes passed out by get_ready() have been marked done(), _and_ there are no more nodes ready to pass out.  At that point, there's a cycle in the input relations if and only if there's some node get_ready() never passed out.

In my prototype implementation, that's another thing prepare() does:  checks for a cycle, and raises CycleError if there is one.  The user can catch & ignore that if they like, and continue calling get_ready() and done() until no more progress can be made.  I think it's more likely, though, that the user would stare at the cycle attached to the CycleError instance, do whatever it takes to break the cycle, and start over again.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue17005>
_______________________________________