[SciPy-user] A first proposal for dataset organization

Wed Sep 19 13:22:19 EDT 2007

David Huard wrote:
> Hi Anne,
> 
> 2007/9/19, Anne Archibald <peridot.faceted at gmail.com
> <mailto:peridot.faceted at gmail.com>>:
> 
>     On 18/09/2007, David Huard <david.huard at gmail.com
>     <mailto:david.huard at gmail.com>> wrote:
> 
>     > For large data sets, I'm not sure I understand what you're
>     meaning. Do you
>     > intend to include netcdf or HDF5 files and provide an interface to
>     access
>     > those data sets so users don't have to bother about the underlying
>     engine ?
>     > Do we really want to distribute a package weighting > 1GB ?
> 
>     One of the points of this project, as I understand it, is to make it
>     convenient for people to get and use real datasets. In particular, one
>     possibility is to not include the data in this package, but instead
>     only a script to download it from (say) the HEASARC. Thus big datasets
>     are not outrageous, and more to the point, we need to be able to deal
>     with them whatever form they are in natively.
> 
> 
> My understanding was rather :
> " ... to make it convenient for people to get and use real datasets for
> use in SciPy and NumPy examples, documentation and tutorials. " This
> limits the scope of the dataset package, at least for starters. If some
> tutorial deals with larger than memory issues, then using a specialized
> binary format makes sense. However, I think that pretty basic datasets
> can illustrate the use of most SciPy and NumPy functions.

That's an important use case, certainly, but I had in mind uses cases like the
one Anne gave, too, when I suggested parts of the design that David implemented.
The scope is still fairly broad.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco