[SciPy-User] format for chunked file save and read ?

Wed Sep 22 14:09:05 EDT 2010

On Wed, Sep 22, 2010 at 1:35 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Wed, Sep 22, 2010 at 9:40 AM, Robert Kern <robert.kern at gmail.com> wrote:
>> On Wed, Sep 22, 2010 at 11:29, Nathaniel Smith <njs at pobox.com> wrote:
>>> On Wed, Sep 22, 2010 at 7:18 AM,  <josef.pktd at gmail.com> wrote:
>>>> What is the best file format for storing temporary data, for chunked
>>>> saving and loading, that only uses numpy and scipy?
>>>> I would like a file format that could be shared cross-platform and
>>>> across python/numpy versions if needed.
>>>
>>> Why not just use pickle? Mmap isn't giving you any advantages here
>>> that I can see, and pickles are much easier to handle when you want to
>>> write things out incrementally.
>>
>> Large arrays are not written or read incrementally in a pickle. We
>> have some tricks in order to not duplicate memory, but they don't
>> always work.
>
> Oh, I see, we're talking past each other. I don't think Josef's
> problem is how to save a large array where you need to avoid copies; I
> think the problem is to compute and save one array, then compute and
> save another array, then compute and save another array, etc. Pickles
> can handle that sort of incrementality just fine :-).

No, Stata appends to the file (as it is described, I don't know what
they are doing internally.)
If you save to a new file, then several files would need to be pieced
together, or the previous data needs to be loaded and saved again.

For example, when they use an optimal stopping rule (estimating the
error of a given number of bootstrap samples), they have to do it in
at least two step, initial number of samples, then update error
estimate, then sample more given the new estimate.

>
> For bootstrapping, if we can construct the whole array in memory even
> once, then there's no need to save them out to a file at all -- the
> bootstrap routine can just return that array and let the user decide
> what they want to do with it!
>
> On Wed, Sep 22, 2010 at 10:04 AM,  <josef.pktd at gmail.com> wrote:
>> I don't like pickles much for anything that needs to be stored for
>> more than 5 minutes, because several times I wasn't able to read them
>> anymore after some version or code changes.
>
> Sure, if you pickle some object, and then change the in-memory
> definition of the object, the pickle system won't magically know how
> to translate the old version of the object into the new version -- I
> assume that's the issue you ran into? (I often use pickles for quick
> ad-hoc storage, but restrict myself to built-in types like tuples and
> dicts for just this reason.) But here you're just talking about
> ndarray's, where pickle compatibility *is* guaranteed (right?).

Yes that would be possible, I got my aversion to pickle after pickling
a Pandas dataframe.

Are ndarrays guaranteed to be pickle compatible with numpy 2.0? Are
all python types guaranteed to be pickle compatible between python 2.5
and 3.x ?

Just a preference given my tastes: I like open standards where I don't
have to worry about any compatibility questions. My favorite data file
format is a csv file. (I lost hdf5 support with numpy 1.4.0)

Josef

>
> -- Nathaniel
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>