[IPython-dev] DAG Dependencies

Satrajit Ghosh satra at mit.edu
Thu Oct 28 14:30:02 EDT 2010


hi brian,

thanks for the responses. i'll touch on a few of them.

> * optionally offload the dag directly to the underlying scheduler if it
> has
> > dependency support (i.e., SGE, Torque/PBS, LSF)
>
> While we could support this, I actually think it would be a step
> backwards.
>
...
>
All of this means lots and lots of latency for each task in the DAG.
> For tasks that have lots of data or lots of Python modules to import,
> that will simply kill the parallel speedup you will get (ala Amdahl's
> law).
>

here is the scenario where this becomes a useful thing (and hence to
optionally have it). let's say under sge usage you have started 10
clients/ipengines. now at the time of creating the clients one machine with
10 allocations was free and sge routed all the 10 clients to that machine.
now this will be the machine that will be used for all ipcluster processing.
whereas if the node distribution and ipengine startup were to happen
simultaneously at the level of the sge scheduler, processes would get routed
to the best available slot at the time of execution.

i agree that in several other scenarios, the current mechanism works great.
but this is a common scenario that we have run into in a heavily used
cluster (limited nodes + lots of users).


> > * something we currently do in nipype is that we provide a configurable
> > option to continue processing if a given node fails. we simply remove the
> > dependencies of the node from further execution and generate a report at
> the
> > end saying which nodes crashed.
>
> I guess I don't see how it was a true dependency then.  Is this like
> an optional dependency?  What are the usage cases for this.
>

perhaps i misunderstood what happens in the current implementation. if you
have a DAG such as (A,B) (B,E) (A,C) (C,D) and let's say C fails, does the
current dag controller continue executing B,E? or does it crash at the first
failure. we have the option to go either way in nipype. if something
crashes, stop or if something crashes, process all things that are not
dependent on the crash.


> > * callback support for node: node_started_cb, node_finished_cb
>
> I am not sure we could support this, because once you create the DAG
> and send it to the scheduler, the tasks are out of your local Python
> session.  IOW, there is really no place to call such callbacks.
>

i'll have to think about this one a little more. one use case for this is
reporting where things stand within the  execution graph (perhaps the
scheduler can report this, although now, i'm back to polling instead of
being called back.)


> > * support for nodes themselves being DAGs
>
...
>
I think for the node is a DAG case, we would just flatten that at
> submission time.  IOW, apply the transformation:
>
> A DAG of nodes, each of which may be a DAG => A DAG of node.
>
> Would this work?
>

this would work, i think we have a slightly more complicated case of this
implemented in nipype, but perhaps i need to think about it again. our case
is like a maptask, where the same thing operates on a list of inputs and
then we collate the outputs back. but as a general purpose mechanism, you
should not worry about this use case now.


> Yes, it does make sense to support DRMAA in ipcluster.  Once Min's
> stuff has been merged into master, we will begin to get it working
> with the batch systems again.
>

great.

cheers,

satra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20101028/84e11991/attachment.html>


More information about the IPython-dev mailing list