[SciPy-Dev] New subpackage: scipy.data

Fri Apr 6 15:08:20 EDT 2018

 I wanted to extend Charles' comment that in addition to the test problems
there are optimization *benchmarks*. For instance, in scipy/benchmarks/
benchmarks/linprog_benchmark_files there are ~90 benchmark problems (all
NETLIB LP benchmarks as .npz files) totaling ~12MB. The current linprog
benchmark only uses two of them by default. Sounds like these should be
moved if space is such a concern.

On Wed, Apr 4, 2018 at 7:54 AM, Charles R Harris <charlesr.harris at gmail.com>
wrote:

>
>
> On Mon, Apr 2, 2018 at 12:50 PM, Warren Weckesser <
> warren.weckesser at gmail.com> wrote:
>
>>
>>
>> On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d at gmail.com>
>>> wrote:
>>>
>>>> Top-level module for them alone sounds overkill, and I'm not sure if
>>>>> discoverability alone is enough.
>>>>>
>>>>
>>>> Fine by me. And if we follow the idea that these should be added
>>>> sparingly, we can maintain discoverability without it growing out of
>>>> hand by populating the See Also sections of each function.
>>>>
>>>
>>> I agree with this, the 2 images and 1 ECG signal (to be added) that we
>>> have doesn't justify a top-level module. We don't want to grow more than
>>> the absolute minimum of datasets. The package is already very large, which
>>> is problematic in certain cases. E.g. numpy + scipy still fits in the AWS
>>> Lambda limit of 50 MB, but there's not much margin.
>>>
>>>
>>
>> Note: this is a reply to the thread, and not specifically to Ralf's
>> comments (but those are included).
>>
>> After reading all the replies, the first question that comes to mind is:
>> should SciPy have *any* datasets?
>>
>> I think this question has already been answered: we have had functions
>> that return images in scipy.misc for a long time, and I don't recall anyone
>> ever suggesting that these be removed.  (Well, there was lena(), but I
>> don't think anyone had a problem with adding a replacement image.)  And the
>> pull request for the ECG dataset has been merged (added to scipy.misc), so
>> there is current support among the developers for providing datasets.
>>
>> So the remaining questions are:
>>    (1) Where do the datasets reside?
>>    (2) What are the criteria for adding a new datasets?
>>
>> Here's my 2¢:
>>
>> (1) Where do the datasets reside?
>>
>> My preference is to keep all the datasets in the top-level module
>> scipy.datasets. Robert preferred this module for discoverability, and I
>> agree.  By having all the datasets in one place, anyone can easily see what
>> is available.  Teachers and others developing educational material know
>> where to find source material for examples.  Developers, too, can easily
>> look for examples to use in our docstrings or tutorials. (By the way,
>> adding examples to the docstrings of all functions is an ongoing effort:
>> https://github.com/scipy/scipy/issues/7168.)
>>
>> Also, there are many well-known datasets that could be used as examples
>> for multiple scipy packages.  For a concrete example, a dataset that I
>> could see adding to scipy is the Hald cement dataset.  SciPy should
>> eventually have an implementation of the PCA decomposition, and it could
>> conceivably live in scipy.linalg.  It would be reasonable to use the Hald
>> data in the docstrings of the new PCA function(s) (cf.
>> https://www.mathworks.com/help/stats/pca.html).  At the same time, the
>> Hald data could enrich the docstrings of some functions in scipy.stats.
>>
>> Similarly, Fisher's iris dataset provides a well-known example that could
>> be used in docstrings in both scipy.cluster and scipy.stats.
>>
>>
>> (2) What are the criteria for adding a new datasets?
>>
>> So far, the only compelling reason I can see to even have datasets is to
>> have interesting examples in the docstrings (or at least in our
>> tutorials).  For example, the docstring for scipy.ndimage.gaussian_filter
>> and several other transformations in ndimage use the image returned by
>> scipy.misc.ascent():
>>
>>     https://docs.scipy.org/doc/scipy/reference/generated/scipy.
>> ndimage.gaussian_filter.html
>>
>> I could see the benefit of having well-known datasets such as Fisher's
>> iris data, the Hald cement data, and some version of a sunspot activity
>> time series, to be used in the docstrings in scipy.stats, scipy.signal,
>> scipy.cluster, scipy.linalg, and elsewhere.
>>
>> Stéfan expressed regret about including datasets in sciki-image.  The
>> main issue seems to be "bloat".  Scikit-image is an image processing
>> library, so the datasets used there are likely all images, and there is a
>> minimum size for a sample image to be useful as an example.  For scipy, we
>> already have two images, and I don't think we'll need more.  The newly
>> added ECG dataset is 116K (which is less than the existing image datasets:
>> "ascent.dat" is 515K and "face.dat" is 1.5M).  The potential datasets that
>> I mentioned above (Hald, iris, sunspots) are all very small.   If we are
>> conservative about what we include, and focus on datasets chosen
>> specifically to demonstrate scipy functionality, we should be able to avoid
>> dataset bloat.
>>
>> This leads to my proposal for the criteria for adding a dataset:
>>
>> (a) Not too big.  The size of a dataset should not exceed $MAX (but I
>> don't have a good suggestion for what $MAX should be at the moment).
>> (b) The dataset should be well-known, where "well-known" means that the
>> dataset is one that is already widely used as an example and many people
>> will know it by name (e.g. the iris dataset), or the dataset is a sample of
>> a common signal type or format (e.g an ECG signal, or an image such as
>> misc.ascent).
>> (c) We actually *use* the dataset in one of *our* docstrings or
>> tutorials.  I don't think our datasets package should become a repository
>> of interesting scientific data with no connection to the scipy code.  Its
>> purpose should be to enrich our documentation.  (Note that by this
>> criterion, the recently added ECG signal would not qualify!)
>>
>> To summarize: I'm in favor scipy.datasets, a conservatively curated
>> subpackage containing well-known datasets.
>>
>>
> There are also some standard functions used for testing optimization. I
> wonder if it would be reasonable to make those public?
>
> Chuck
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
>

-- 
Matt Haberland
Assistant Adjunct Professor in the Program in Computing
Department of Mathematics
6617A Math Sciences Building, UCLA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180406/66e235b4/attachment.html>