[SciPy-Dev] New subpackage: scipy.data

Fri Mar 30 10:29:48 EDT 2018

On Fri, Mar 30, 2018 at 9:54 AM, Eric Larson <larson.eric.d at gmail.com> wrote:
>>> It depends on the scale where this should go.
>
> In this particular case ("scipy.signal currently has no useful realistic
> signals"), if we add the proposed ~100 kB data file, I suspect that we can
> greatly enhance a large number of our scipy.signal examples. An ECG signal
> won't be perfect for all of them, but in many cases it will be a lot better
> and more instructive for users than what we can currently synthesize
> ourselves (while keeping synthesis sufficiently simple at least).
>
> Compared to a general dataset-fetching utility, the in-repo approach has
> clear disadvantages in terms of being incomplete and adding to repo size.
> Its advantages are in terms of simplifying doc building, access,
> maintenance, uniformity of functionality (benchmarks, Debian unit tests, doc
> building, etc.). On the balance, this makes it worth having IMO.
>
>> For example, a dataset package also runs into the problem how much to
>> include.
>
>
> A proposed rule of thumb: SciPy can have (up to) a couple of small-sized
> files per module shipped with the repo in cases where such files greatly
> improve our ability to showcase/test/document functionality (benchmarks/unit
> tests/docstrings). This forces us to make subjective judgments about what
> will be sufficiently useful, sufficiently small, and sufficiently impactful
> for the module, but I think this will be a rare enough phenomenon that it's
> okay.
>
> In other words, I propose that scipy.datasets not provide an exhaustive or
> even extensive resource of data for users, but rather a minimal one for
> showcasing functionality. This seems consistent with what we already do with
> ascent/face, in that they improve the image-processing examples.
>
>> We've been doing this in scikit-image for a long time, and now regret
>> having any binary data in the repository
>
>
> I have had a similar problem while maintaining MNE-Python, which has some
> files in the repo and others in a GitHub repo (downloaded separately for
> testing). I have a similar feeling about the files that live in the repo
> today. However, for SciPy the problem seems a bit different in scope and
> scale -- a handful of small files can go a long way for SciPy, which isn't
> the case for MNE (and I would assume also many functions in scikit-image).
>
>> both scikit-learn and scikit-image use access to larger datasets.
>
>
> There are other projects that also do this (MNE has huge ones hosted on
> osf.io, VisPy hosts data on GitHub). It would be awesome if someone unified
> all this stuff for cases where you want to deal with getting large datasets,
> or many different datasets.


just to say:
I agree with all of this,and think it is a very good summary of the issues

Josef

>
> My 2c,
> Eric
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>