[SciPy-dev] Starting a datasets package, again

Fri Jun 8 18:36:01 EDT 2007

On 6/6/07, Robert Kern <robert.kern at gmail.com> wrote:
>
> David Cournapeau wrote:
> > There are already so many emails on the scipy ML (and personally, maybe
> > 2/3 of the emails related to my packages) because of installation
> > problems, I really worry about this point. I think this hurts the whole
> > numpy/scipy community quite a lot (lack of one click button "make it
> > work"), and I am afraid this may be a step away from this goal.
>
> There's no substitute for giving your users a binary with everything it
> needs in
> one tarball, data included. However, that doesn't scale at all. Everything
> else
> is a compromise between these two concerns. If bundling the example data
> into
> your examples works for your needs, by all means, do it, and ignore all
> notions
> of scipydata packages. There's nothing wrong with copy-and-paste, here.
>
> It's still useful to build a repository of scipydata packages with
> metadata and
> parsing code already done. If you are only concerned with distributing
> examples
> with your packages, you may not use the scipydata packages in them
> directly, but
> you can still use the repository as a resource when developing your
> examples.

We have run into this same issue of large example/testing datasets in the
nipy (neuroimaging.scipy.org) project.  Instead of packaging our data as a
separate installable
dependency, we keep the data online and developed a bit of boilerplate to
transparently access it at runtime, including downloading, cacheing, and
potentially unzipping:

http://projects.scipy.org/neuroimaging/ni/browser/ni/trunk/neuroimaging/data_io/datasource.py

Used in scipy for example this might look something like:

>>> from scipy.data import Repository
>>> repo = Repository(http://data.scipy.org/) # this could be set as a
default
>>> datablob = repo.open("pyem/example1.mat.bz2").read()

The first time you run this it would download, unzip, and drop the result
under some cache directory, then subsequent opens would open the local
file.  This way only the necessary data gets downloaded.
The only non-builtin dependency is the path module, which is standalone
(used in place of os.path) and found under neuroimaging.utils.path.  Feel
free to copy and modify if this is a direction you want to go.  And if you
do use it, then we could import it from scipy instead of maintaining our own
copy :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20070608/5433a831/attachment.html>