[SciPy-User] format for chunked file save and read ?

Nathaniel Smith njs at pobox.com
Wed Sep 22 12:29:38 EDT 2010


On Wed, Sep 22, 2010 at 7:18 AM,  <josef.pktd at gmail.com> wrote:
> What is the best file format for storing temporary data, for chunked
> saving and loading, that only uses numpy and scipy?
> I would like a file format that could be shared cross-platform and
> across python/numpy versions if needed.

Why not just use pickle? Mmap isn't giving you any advantages here
that I can see, and pickles are much easier to handle when you want to
write things out incrementally.

> usecase: Stata is (optionally) saving all Bootstrap samples to a file
> so that the same samples will be available if a follow-up analysis is
> desired/required.
>
> We could also just save the seed and redo the same samples which might
> however not be fast for some models

You should save the seed in any case!

For probably most bootstrap purposes, it would work fine to just save
the samples themselves in the bootstrap object, or have them as an
extra return value. The 'boot' package for R does this. Most bootstrap
results don't involve huge amounts of memory.

On a more general note, I think APIs that take a filename and store
some of their (logical) return values there are somewhat "smelly"[1].
Managing temporary files programmatically is a huge pain, esp. when
I'll just need to read the results back out again or whatever. If
you're worried about memory use, maybe let the user pass in a callback
that will be called with each sample in turn, instead of hard-coding
this temporary file thing?

[1] http://c2.com/xp/CodeSmell.html

Cheers,
-- Nathaniel



More information about the SciPy-User mailing list