Pickle based workflow - looking for advice

Mon Apr 13 13:30:24 EDT 2015

On 13.04.2015 18:25, Dave Angel wrote:
> On 04/13/2015 10:58 AM, Fabien wrote:
>> Folks,
>>
>
> A comment.  Pickle is a method of creating persistent data, most
> commonly used to preserve data between runs.  A database is another
> method.  Although either one can also be used with multiprocessing, you
> seem to be worrying more about the mechanism, and not enough about the
> problem.
>
>> I am writing a quite extensive piece of scientific software. Its
>> workflow is quite easy to explain. The tool realizes series of
>> operations on watersheds (such as mapping data on it, geostatistics and
>> more). There are thousands of independent watersheds of different size,
>> and the size determines the computing time spent on each of them.
>
> First question:  what is the name or "identity" of a watershed?
> Apparently it's named by a directory.  But you mention ID as well.  You
> write a function A() that takes only a directory name. Is that the name
> of the watershed?  One per directory?  And you can derive the ID from
> the directory name?
>
> Second question, is there any communication between watersheds, or are
> they totally independent?
>
> Third:  this "external data", is it dynamic, do you have to fetch it in
> a particular order, is it separated by watershed id, or what?
>
> Fourth:  when the program starts, are the directories all empty, so the
> presence of a pickle file tells you that A() has run?  Or is there some
> other meaning for those files?
>
>>
>> Say I have the operations A, B, C and D. B and C are completely
>> independent but they need A to be run first, D needs B and C, and so
>> forth. Eventually the whole operations A, B, C and D will run once for
>> all,
>
> For all what?
>
>> but of course the whole development is an iterative process and I
>> rerun all operations many times.
>
> Based on what?  Is the external data changing, and you have to rerun
> functions to update what you've already stored about them?  Or do you
> just mean you call the A() function on every possible watershed?
>
>
>
> (I suddenly have to go out, so I can't comment on the rest, except that
> choosing to pickle, or to marshall, or to database, or to
> custom-serialize seems a bit premature.  You may have it all clear in
> your head, but I can't see what the interplay between all these calls to
> one-letter-named functions is intended to be.)

Thanks Dave for your interest. I'll make an example:

external files:
- watershed outlines (single file)
- global topography (single file)
- climate data (single file)

Each watershed has an ID. Each watershed is completely independant.

So the function A for example will take one ID as argument, open the 
watershed file and extract its outlines, make a local map, open the 
topography file, extract a part of it, make a watershed object and store 
the watersheds local data in it.

Function B will open the watershed pickle, take the local information it 
needs (like local topography, already cropped to the region of interest) 
and map climate data on it.

And so forth, so that each function A, B, C, ... builds upon the 
information of the others and adds it's own "service" in terms of data.

Currently, all data (numpy arrays and vecor objects mostly) are stored 
as object attributes, which is I guess bad practice. It's kind of a 
"database for dummies": read topography of watershed ID 0128 will be:
- open watershed.p in the '0128' directory
- read the watershed.topography attribute

I think that I like Peter's idea to follow a file based workflow 
instead, and forget about my watershed object for now.

But I'd still be interested in your comments if you find time for it.

Fabien