[SciPy-Dev] New subpackage: scipy.data

Thu Mar 29 20:52:34 EDT 2018

I also think that at most small datasets should be included in scipy directly.
But I think that for online storage scipy would be better off following
some other packages.
Stefan mentions some attempts to get to a common format.
AFAIK without being up to date, both scikit-learn and scikit-image use access
to larger datasets.

For example, a dataset package also runs into the problem how much to
include. I wouldn't install a dataset package with a few gigabyte of data
if I'm only interested in a tiny fraction for the examples that are
relevant to me.
(I'm not into analyzing images, movies or BIG DATA.)

Josef

On Thu, Mar 29, 2018 at 8:10 PM, Ilhan Polat <ilhanpolat at gmail.com> wrote:
> Yes, that's true but GitHub seems like a robust place to live. Otherwise we
> can just point to any hardcoded URL. But if the size gets bigger in terms of
> wheels and cloning I think within SciPy doesn't seem to be a viable option.
> These all depend on what the future of datasets would be.
>
> On Fri, Mar 30, 2018 at 2:03 AM, <josef.pktd at gmail.com> wrote:
>>
>> On Thu, Mar 29, 2018 at 7:54 PM, Ilhan Polat <ilhanpolat at gmail.com> wrote:
>> > Would a separate repo scipy-datasets help ? Then something like
>> >
>> > try:
>> >      importing
>> > except:
>> >     warn('I'm off to interwebz')
>> >     download from the repo
>> >
>> > might be feasible. The download part can either be that particular
>> > dataset
>> > or the whole scipy-datasets clone.
>> >
>>
>> IMO:
>>
>> It depends on the scale where this should go.
>> I don't think it's worth it (maintaining and installing another
>> package or repo) for scipy
>> given that scipy is mostly a basic numerical library and not driven by
>> specific
>> applications.
>>
>> For most areas there should be already some online repos or packages and
>> it would be enough to have the accessing functions in scipy.datasets.
>> The only area that I can think of where there might not be some readily
>> available online source for datasets is signal.
>>
>> Josef
>>
>>
>> >
>> >
>> >
>> > On Fri, Mar 30, 2018 at 1:16 AM, Stefan van der Walt
>> > <stefanv at berkeley.edu>
>> > wrote:
>> >>
>> >> On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:
>> >> > Can you summarize the problems that make you regret including the
>> >> > data?
>> >>
>> >> - The size of the repository (extra time on each clone, and that for
>> >>   data that isn't necessary in most use cases)
>> >>
>> >> - Artificial limit on data sizes: we now have a default place to store
>> >>   data, but we still need an additional mechanism for larger datasets.
>> >>   How do you choose the threshold for what goes in, what is too big?
>> >>
>> >> - Because these tiny embedded datasets are easily available, they
>> >> become
>> >>   the default for demos.  If data is stored externally, realistic
>> >>   examples become more feasible and likely.
>> >>
>> >> Best regards
>> >> Stéfan
>> >> _______________________________________________
>> >> SciPy-Dev mailing list
>> >> SciPy-Dev at python.org
>> >> https://mail.python.org/mailman/listinfo/scipy-dev
>> >
>> >
>> >
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev at python.org
>> > https://mail.python.org/mailman/listinfo/scipy-dev
>> >
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at python.org
>> https://mail.python.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>