[SciPy-user] A first proposal for dataset organization

Thu Sep 20 14:00:44 EDT 2007

David Huard wrote:
> OK. So here is my understanding of what has been said so far about the
> scope of the package, please correct me if I'm wrong.

For my part, I would modify most of these.

>  * Provide data sets for testing, demos and tutorials of scipy and numpy
> functions.

Agree.

>  * Propose a standard format to store data in text/binary files.

This wouldn't be on my radar at all. I think there is much less to be gained
from this than having a reasonably consistent API at the Python level for
accessing the data in whatever format it happens to be.

>  * Propose a format to represent the data internally (dictionary, record
> arrays, masked arrays, timeseries, etc).

Somewhat. I think it's useful to have a consistent API at the surface: load()
should probably always return a dictionary. However, I'm less concerned about
standardizing what's underneath. Each dataset has different needs. Trying to
force it into something inappropriate is a waste of effort.

Instead of standards, I'd prefer (multiple) conventions that we simply
encourage. We encourage those conventions by providing utilities that manipulate
data that follows the conventions. For example, in his "Format of the data"
section, David Cournapeau suggested a convention for machine learning datasets
and some operations that would be useful to implement on top of that convention.
For other fields, other conventions might be used.

>  * Implement an API  to store/retrieve the data to/from text or binary
> files based on the standard.

Instead, I would provide some utilities for loading common formats. From the
perspective of the user of the dataset, the only real API would be load() and
the metadata. For the developer of the dataset, we would have a number of
utilities to help them implement the load() function for their dataset.

>  * Provide utilities to import data sets from web archives and convert
> them to the proposed format.

Rather, provide utilities for importing data sets from a URL and caching them in
a location established by convention. Parsing the files is dependent on the
format; instead of writing format conversion code, just write the loading code.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco