[SciPy-Dev] New subpackage: scipy.data

Mon Apr 2 14:50:36 EDT 2018

On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers at gmail.com>
wrote:

>
>
> On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d at gmail.com>
> wrote:
>
>> Top-level module for them alone sounds overkill, and I'm not sure if
>>> discoverability alone is enough.
>>>
>>
>> Fine by me. And if we follow the idea that these should be added
>> sparingly, we can maintain discoverability without it growing out of
>> hand by populating the See Also sections of each function.
>>
>
> I agree with this, the 2 images and 1 ECG signal (to be added) that we
> have doesn't justify a top-level module. We don't want to grow more than
> the absolute minimum of datasets. The package is already very large, which
> is problematic in certain cases. E.g. numpy + scipy still fits in the AWS
> Lambda limit of 50 MB, but there's not much margin.
>
>

Note: this is a reply to the thread, and not specifically to Ralf's
comments (but those are included).

After reading all the replies, the first question that comes to mind is:
should SciPy have *any* datasets?

I think this question has already been answered: we have had functions that
return images in scipy.misc for a long time, and I don't recall anyone ever
suggesting that these be removed.  (Well, there was lena(), but I don't
think anyone had a problem with adding a replacement image.)  And the pull
request for the ECG dataset has been merged (added to scipy.misc), so there
is current support among the developers for providing datasets.

So the remaining questions are:
   (1) Where do the datasets reside?
   (2) What are the criteria for adding a new datasets?

Here's my 2¢:

(1) Where do the datasets reside?

My preference is to keep all the datasets in the top-level module
scipy.datasets. Robert preferred this module for discoverability, and I
agree.  By having all the datasets in one place, anyone can easily see what
is available.  Teachers and others developing educational material know
where to find source material for examples.  Developers, too, can easily
look for examples to use in our docstrings or tutorials. (By the way,
adding examples to the docstrings of all functions is an ongoing effort:
https://github.com/scipy/scipy/issues/7168.)

Also, there are many well-known datasets that could be used as examples for
multiple scipy packages.  For a concrete example, a dataset that I could
see adding to scipy is the Hald cement dataset.  SciPy should eventually
have an implementation of the PCA decomposition, and it could conceivably
live in scipy.linalg.  It would be reasonable to use the Hald data in the
docstrings of the new PCA function(s) (cf.
https://www.mathworks.com/help/stats/pca.html).  At the same time, the Hald
data could enrich the docstrings of some functions in scipy.stats.

Similarly, Fisher's iris dataset provides a well-known example that could
be used in docstrings in both scipy.cluster and scipy.stats.

(2) What are the criteria for adding a new datasets?

So far, the only compelling reason I can see to even have datasets is to
have interesting examples in the docstrings (or at least in our
tutorials).  For example, the docstring for scipy.ndimage.gaussian_filter
and several other transformations in ndimage use the image returned by
scipy.misc.ascent():

https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter.html

I could see the benefit of having well-known datasets such as Fisher's iris
data, the Hald cement data, and some version of a sunspot activity time
series, to be used in the docstrings in scipy.stats, scipy.signal,
scipy.cluster, scipy.linalg, and elsewhere.

Stéfan expressed regret about including datasets in sciki-image.  The main
issue seems to be "bloat".  Scikit-image is an image processing library, so
the datasets used there are likely all images, and there is a minimum size
for a sample image to be useful as an example.  For scipy, we already have
two images, and I don't think we'll need more.  The newly added ECG dataset
is 116K (which is less than the existing image datasets: "ascent.dat" is
515K and "face.dat" is 1.5M).  The potential datasets that I mentioned
above (Hald, iris, sunspots) are all very small.   If we are conservative
about what we include, and focus on datasets chosen specifically to
demonstrate scipy functionality, we should be able to avoid dataset bloat.

This leads to my proposal for the criteria for adding a dataset:

(a) Not too big.  The size of a dataset should not exceed $MAX (but I don't
have a good suggestion for what $MAX should be at the moment).
(b) The dataset should be well-known, where "well-known" means that the
dataset is one that is already widely used as an example and many people
will know it by name (e.g. the iris dataset), or the dataset is a sample of
a common signal type or format (e.g an ECG signal, or an image such as
misc.ascent).
(c) We actually *use* the dataset in one of *our* docstrings or tutorials.
I don't think our datasets package should become a repository of
interesting scientific data with no connection to the scipy code.  Its
purpose should be to enrich our documentation.  (Note that by this
criterion, the recently added ECG signal would not qualify!)

To summarize: I'm in favor scipy.datasets, a conservatively curated
subpackage containing well-known datasets.

Warren

P.S. I should add that I'm not in favor putting code in scipy that fetches
data from the web.  That type of data retrieval could be useful, but it
seems more appropriate for a package that is independent of scipy.

> Ralf
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180402/a8fd6afa/attachment.html>