[Numpy-discussion] Thoughts on persistence/object tracking in scientific code

Gael Varoquaux gael.varoquaux at normalesup.org
Mon Dec 29 17:40:07 EST 2008


Hi Luis,

On Mon, Dec 29, 2008 at 02:51:48PM -0500, Luis Pedro Coelho wrote:
> I coincidently started my own implementation of a system to manage
> intermediate results last week, which I called jug. I wasn't planning
> to make such an alpha version public just now, but it seems to be on
> topic.

Thanks for your input. This comforts me in my hunch that these problems
where universal.

It is interesting to see that you take a slightly different approach than
the others already discussed. This probably stems from the fact that you
are mostly interested by parallelism, whereas there are other adjacent
problems that can be solved by similar abstractions. In particular, I
have the impression that you do not deal with what I call
"lazy-revaluation". In other words, I am not sure if you track results
enough to know whether a intermediate result should be re-run, or if you
run a 'clean' between each run to avoid this problem.

I must admit I went away from using hash to store objects to the disk
because I am very much interested in traceability, and I wanted my
objects to have meaningful names, and to be stored in convenient formats
(pickle, numpy .npy, hdf5, or domain-specific). I have now realized that
explicit naming is convenient, but it should be optional.

Your task-based approach, and the API you have built around it, reminds
my a bit of twisted deferred. Have you studied this API?

> A trick that helps is that I don't really use the argument values to hash 
> (which would be unwieldy for big arrays). I use the computation path (e.g., 
> this is the value obtained from f(g('something'),2)). Since, at least in my 
> problems, things tend to always map back into simple file-system paths, the 
> hash computation doesn't even need to load the intermediate results.

I did notice too that using the argument value to hash was bound to
failure in all but the simplest case. This is the immediate limitation to
the famous memoize pattern when applied to scientific code. If I
understand well, what you do is that you track the 'history' of the
object and use it as a hash to the object, right? I had come to the
conclusion that the history of objects should be tracked, but I hadn't
realized that using it as a hash was also a good way to solve the scoping
problem. Thanks for the trick.

Would you consider making the code BSD? Because I want to be able to
reuse my code in non open-source project, and because I do not want to
lock out contributors, or to ask for copyright assignment, I like to
keep all my code BSD, as all the mainstream scientific Python projects. 

I'll start writing up a wiki page with the all the different learning and
usecases that come from all this interesting feedback.

Cheers,

Gaël



More information about the NumPy-Discussion mailing list