Pickle based workflow - looking for advice

Mon Apr 13 13:39:31 EDT 2015

On 13.04.2015 17:45, Devin Jeanpierre wrote:
> On Mon, Apr 13, 2015 at 10:58 AM, Fabien<fabien.maussion at gmail.com>  wrote:
>> >Now, to my questions:
>> >1. Does that seem reasonable?
> A big issue is the use of pickle, which is:
>
> * Often suboptimal performance wise (e.g. you can't load only subsets
> of the data)
> * Makes forwards/backwards compatibility very difficult
> * Can make python 2/3 migrations harder
> * Creates data files which are difficult to analyze/fix by hand if
> they get broken
> * Is schemaless, and can accidentally include irrelevant data you
> didn't mean to store, making all of the above worse.
> * Means you have to be very careful who wrote the pickles, or you open
> a remote code execution vulnerability. It's common for people to
> forget that code is unsafe, and get themselves pwned. Security is
> always better if you don't do anything bad in the first place, than if
> you do something bad but try to manage the context in which the bad
> thing is done.
>
> Cap'n Proto might be a decent alternatives that gives you good
> performance, by letting you process only the bits of the file you want
> to. It is also not a walking security nightmare.

Thanks for your thoughts. All these concerns are rather secondary for 
the kind of tool I am working on, with the exception of speed. I will 
have a look at Proto

>
>> >2. Should Watershed be an object or should it be a simple dictionary? I
>> >thought that an object could be nice, because it could take care of some
>> >operations such as plotting and logging. Currently I defined a class
>> >Watershed, but its attributes are defined and filled by A, B and C (this
>> >seems a bit wrong to me).
> It is usually very confusing for attributes to be defined anywhere
> other than __init__. It's very really confusing for them to be defined
> by some random other function living somewhere else.

Yes, OK. I will stop that.

>> >I could give more responsibilities to this class
>> >but it might become way too big: since the whole purpose of the tool is to
>> >work on watersheds, making a Watershed class actually sounds like a code
>> >smell (http://en.wikipedia.org/wiki/God_object)
> Whether they are methods or not doesn't make this any more or less of
> a god object -- if it stores all this data used by all these different
> things, it is already a bit off.

Yes, but I see no other way. The "god" container will probably be the 
watershed's directory with the data in it. The rest will specialize.

>> >3. The operation A opens an external file, reads data out of it and writes
>> >it in Watershed object. Is it a bad idea to multiprocess this? (I guess it
>> >is, since the file might be read twice at the same time)
> That does sound like a bad idea, for the reason you gave. It might be
> possible to read it once, and share it among many processes.

Yes. Thanks!