[SciPy-user] A first proposal for dataset organization

Emanuele Olivetti emanuele at relativita.com
Wed Sep 19 08:50:31 EDT 2007


Hi David & David,

I like your proposal too but instead I'm very interested in missing data
so I'd like to have them in your proposal. And indeed using
'NaN' as a placeholder for a missing entries IS a bad idea. Unfortunately
I've no "best" solution to provide. In what I'm doing I use two
matrices: one to store actual values and another boolean matrix to
say where the missing values are. Avoiding the use of values in entries
marked as missing is responsibility of the analysis step.
This is good for my case but may not be that wonderful in general.

About handling large datasets I had some experience using NiPy:
http://neuroimaging.scipy.org/
They have (had?) one implementation using mapped arrays that
is good for many users; but my need was to access all the data
without the disk bottleneck and even though I had enough RAM
I had some trouble to avoid the memory mapping and do just
the full load. So the lesson I learnt is that the users needs are not
uniform and a library should take in to account always the basic
case (full load). P.S. NiPy did it.

Hope this helps. More later.

Cheers,

Emanuele





More information about the SciPy-User mailing list