[SciPy-user] A first proposal for dataset organization

Thu Sep 20 06:05:24 EDT 2007

Robert Kern wrote:
> David Huard wrote:
>> Hi Anne,
>>
>> 2007/9/19, Anne Archibald <peridot.faceted at gmail.com
>> <mailto:peridot.faceted at gmail.com>>:
>>
>>     On 18/09/2007, David Huard <david.huard at gmail.com
>>     <mailto:david.huard at gmail.com>> wrote:
>>
>>     > For large data sets, I'm not sure I understand what you're
>>     meaning. Do you
>>     > intend to include netcdf or HDF5 files and provide an interface to
>>     access
>>     > those data sets so users don't have to bother about the underlying
>>     engine ?
>>     > Do we really want to distribute a package weighting > 1GB ?
>>
>>     One of the points of this project, as I understand it, is to make it
>>     convenient for people to get and use real datasets. In particular, one
>>     possibility is to not include the data in this package, but instead
>>     only a script to download it from (say) the HEASARC. Thus big datasets
>>     are not outrageous, and more to the point, we need to be able to deal
>>     with them whatever form they are in natively.
>>
>>
>> My understanding was rather :
>> " ... to make it convenient for people to get and use real datasets for
>> use in SciPy and NumPy examples, documentation and tutorials. " This
>> limits the scope of the dataset package, at least for starters. If some
>> tutorial deals with larger than memory issues, then using a specialized
>> binary format makes sense. However, I think that pretty basic datasets
>> can illustrate the use of most SciPy and NumPy functions.
>
> That's an important use case, certainly, but I had in mind uses cases like the
> one Anne gave, too, when I suggested parts of the design that David implemented.
> The scope is still fairly broad.
Yes, indeed, my sentence "to make it convenient for people to get and 
use real datasets for use in SciPy and NumPy examples, documentation and 
tutorials" was just a list of possible usages, not the only usages to 
take into account. I realized also that my proposal sounded like I was 
the only involved, which was not the case. I hope people involved in 
previous discussion on that matter didn't take any offence.

David (Huard) already highlighted one problem with my proposal (time 
series representation). I would really be interested in comments about 
using MaskedArrays to handle missing data (I've never used it myself), 
and the use of record arrays for the data; for example, I can see cases 
where record arrays may be a problem (if all your data are homogenous, 
you cannot treat the data as a big numpy array), but I don't know if 
this is significant.

cheers,

David