[SciPy-Dev] New subpackage: scipy.data

Eric Larson larson.eric.d at gmail.com
Fri Mar 30 09:54:29 EDT 2018


>
> It depends on the scale where this should go.
>>
> In this particular case ("scipy.signal currently has no useful realistic
signals"), if we add the proposed ~100 kB data file, I suspect that we can
greatly enhance a large number of our scipy.signal examples. An ECG signal
won't be perfect for all of them, but in many cases it will be a lot better
and more instructive for users than what we can currently synthesize
ourselves (while keeping synthesis sufficiently simple at least).

Compared to a general dataset-fetching utility, the in-repo approach has
clear disadvantages in terms of being incomplete and adding to repo size.
Its advantages are in terms of simplifying doc building, access,
maintenance, uniformity of functionality (benchmarks, Debian unit tests,
doc building, etc.). On the balance, this makes it worth having IMO.

For example, a dataset package also runs into the problem how much to
> include.


A proposed rule of thumb: SciPy can have (up to) a couple of small-sized
files per module shipped with the repo in cases where such files greatly
improve our ability to showcase/test/document functionality
(benchmarks/unit tests/docstrings). This forces us to make subjective
judgments about what will be sufficiently useful, sufficiently small, and
sufficiently impactful for the module, but I think this will be a rare
enough phenomenon that it's okay.

In other words, I propose that scipy.datasets not provide an *exhaustive* or
even *extensive *resource of data for users, but rather a *minimal* one for
showcasing functionality. This seems consistent with what we already do
with ascent/face, in that they improve the image-processing examples.

We've been doing this in scikit-image for a long time, and now regret
> having any binary data in the repository


I have had a similar problem while maintaining MNE-Python, which has some
files in the repo and others in a GitHub repo (downloaded separately for
testing). I have a similar feeling about the files that live in the repo
today. However, for SciPy the problem seems a bit different in scope and
scale -- a handful of small files can go a long way for SciPy, which isn't
the case for MNE (and I would assume also many functions in scikit-image).

both scikit-learn and scikit-image use access to larger datasets.


There are other projects that also do this (MNE has huge ones hosted on
osf.io, VisPy hosts data on GitHub). It would be awesome if someone unified
all this stuff for cases where you want to deal with getting large
datasets, or many different datasets.

My 2c,
Eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180330/d3adf71e/attachment.html>


More information about the SciPy-Dev mailing list