Pickle based workflow - looking for advice

Mon Apr 13 13:08:50 EDT 2015

Fabien wrote:

> I am writing a quite extensive piece of scientific software. Its
> workflow is quite easy to explain. The tool realizes series of
> operations on watersheds (such as mapping data on it, geostatistics and
> more). There are thousands of independent watersheds of different size,
> and the size determines the computing time spent on each of them.
> 
> Say I have the operations A, B, C and D. B and C are completely
> independent but they need A to be run first, D needs B and C, and so
> forth. Eventually the whole operations A, B, C and D will run once for
> all, but of course the whole development is an iterative process and I
> rerun all operations many times.

> 4. Other comments you might have?

How about a file-based workflow?

Write distinct scripts, e. g.

a2b.py that reads from *.a and writes to *.b

and so on. Then use a plain old makefile to define the dependencies.
Whether .a uses pickle, .b uses json, and .z uses csv is but an 
implementation detail that only its producers and consumers need to know. 
Testing an arbitrary step is as easy as invoking the respective script with 
some prefabricated input and checking the resulting output file(s).