Pickle based workflow - looking for advice

Mon Apr 13 12:25:38 EDT 2015

On 04/13/2015 10:58 AM, Fabien wrote:
> Folks,
>

A comment.  Pickle is a method of creating persistent data, most 
commonly used to preserve data between runs.  A database is another 
method.  Although either one can also be used with multiprocessing, you 
seem to be worrying more about the mechanism, and not enough about the 
problem.

> I am writing a quite extensive piece of scientific software. Its
> workflow is quite easy to explain. The tool realizes series of
> operations on watersheds (such as mapping data on it, geostatistics and
> more). There are thousands of independent watersheds of different size,
> and the size determines the computing time spent on each of them.

First question:  what is the name or "identity" of a watershed? 
Apparently it's named by a directory.  But you mention ID as well.  You 
write a function A() that takes only a directory name. Is that the name 
of the watershed?  One per directory?  And you can derive the ID from 
the directory name?

Second question, is there any communication between watersheds, or are 
they totally independent?

Third:  this "external data", is it dynamic, do you have to fetch it in 
a particular order, is it separated by watershed id, or what?

Fourth:  when the program starts, are the directories all empty, so the 
presence of a pickle file tells you that A() has run?  Or is there some 
other meaning for those files?

>
> Say I have the operations A, B, C and D. B and C are completely
> independent but they need A to be run first, D needs B and C, and so
> forth. Eventually the whole operations A, B, C and D will run once for
> all,

For all what?

> but of course the whole development is an iterative process and I
> rerun all operations many times.

Based on what?  Is the external data changing, and you have to rerun 
functions to update what you've already stored about them?  Or do you 
just mean you call the A() function on every possible watershed?

(I suddenly have to go out, so I can't comment on the rest, except that 
choosing to pickle, or to marshall, or to database, or to 
custom-serialize seems a bit premature.  You may have it all clear in 
your head, but I can't see what the interplay between all these calls to 
one-letter-named functions is intended to be.)

-- 
DaveA