Pickle based workflow - looking for advice

Mon Apr 13 11:45:50 EDT 2015

On Mon, Apr 13, 2015 at 10:58 AM, Fabien <fabien.maussion at gmail.com> wrote:
> Now, to my questions:
> 1. Does that seem reasonable?

A big issue is the use of pickle, which is:

* Often suboptimal performance wise (e.g. you can't load only subsets
of the data)
* Makes forwards/backwards compatibility very difficult
* Can make python 2/3 migrations harder
* Creates data files which are difficult to analyze/fix by hand if
they get broken
* Is schemaless, and can accidentally include irrelevant data you
didn't mean to store, making all of the above worse.
* Means you have to be very careful who wrote the pickles, or you open
a remote code execution vulnerability. It's common for people to
forget that code is unsafe, and get themselves pwned. Security is
always better if you don't do anything bad in the first place, than if
you do something bad but try to manage the context in which the bad
thing is done.

Cap'n Proto might be a decent alternatives that gives you good
performance, by letting you process only the bits of the file you want
to. It is also not a walking security nightmare.

> 2. Should Watershed be an object or should it be a simple dictionary? I
> thought that an object could be nice, because it could take care of some
> operations such as plotting and logging. Currently I defined a class
> Watershed, but its attributes are defined and filled by A, B and C (this
> seems a bit wrong to me).

It is usually very confusing for attributes to be defined anywhere
other than __init__. It's very really confusing for them to be defined
by some random other function living somewhere else.

> I could give more responsibilities to this class
> but it might become way too big: since the whole purpose of the tool is to
> work on watersheds, making a Watershed class actually sounds like a code
> smell (http://en.wikipedia.org/wiki/God_object)

Whether they are methods or not doesn't make this any more or less of
a god object -- if it stores all this data used by all these different
things, it is already a bit off.

> 3. The operation A opens an external file, reads data out of it and writes
> it in Watershed object. Is it a bad idea to multiprocess this? (I guess it
> is, since the file might be read twice at the same time)

That does sound like a bad idea, for the reason you gave. It might be
possible to read it once, and share it among many processes.

-- Devin