[SciPy-Dev] SciPy Goal

Thu Jan 5 01:47:13 EST 2012

On Thu, Jan 5, 2012 at 7:26 AM, Travis Oliphant <travis at continuum.io> wrote:

>
> On Jan 5, 2012, at 12:02 AM, Warren Weckesser wrote:
>
>
>
> On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis at continuum.io>wrote:
>
>>
>> On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
>>
>> > Hi all,
>> >
>> > On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis at continuum.io>
>> wrote:
>> >> What do others think is missing?  Off the top of my head:   basic
>> wavelets
>> >> (dwt primarily) and more complete interpolation strategies (I'd like to
>> >> finish the basic interpolation approaches I started a while ago).
>> >> Originally, I used GAMS as an "overview" of the kinds of things needed
>> in
>> >> SciPy.   Are there other relevant taxonomies these days?
>> >
>> > Well, probably not something that fits these ideas for scipy
>> > one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
>> > from Berkeley' paper on parallel computing is not a bad starting
>> > point; summarized here they are:
>> >
>> >    Dense Linear Algebra
>> >    Sparse Linear Algebra [1]
>> >    Spectral Methods
>> >    N-Body Methods
>> >    Structured Grids
>> >    Unstructured Grids
>> >    MapReduce
>> >    Combinational Logic
>> >    Graph Traversal
>> >    Dynamic Programming
>> >    Backtrack and Branch-and-Bound
>> >    Graphical Models
>> >    Finite State Machines
>>
>>
>> This is a nice list, thanks!
>>
>> >
>> > Descriptions of each can be found here:
>> > http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
>> > here:
>> >
>> > http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
>> >
>> > That list is biased towards the classes of codes used in
>> > supercomputing environments, and some of the topics are probably
>> > beyond the scope of scipy (say structured/unstructured grids, at least
>> > for now).
>> >
>> > But it can be a decent guiding outline to reason about what are the
>> > 'big areas' of scientific computing, so that scipy at least provides
>> > building blocks that would be useful in these directions.
>> >
>>
>> Thanks for the links.
>>
>>
>> > One area that hasn't been directly mentioned too much is the situation
>> > with statistical tools.  On the one hand, we have the phenomenal work
>> > of pandas, statsmodels and sklearn, which together are helping turn
>> > python into a great tool for statistical data analysis (understood in
>> > a broad sense).  But it would probably be valuable to have enough of a
>> > statistical base directly in numpy/scipy so that the 'out of the box'
>> > experience for statistical work is improved.  I know we have
>> > scipy.stats, but it seems like it needs some love.
>>
>> It seems like scipy stats has received quite a bit of attention.   There
>> is always more to do, of course, but I'm not sure what specifically you
>> think is missing or needs work.
>
>
>
> Test coverage, for example.  I recently fixed several wildly incorrect
> skewness and kurtosis formulas for some distributions, and I now have very
> little confidence that any of the other distributions are correct.  Of
> course, most of them probably *are* correct, but without tests, all are in
> doubt.
>
>
> There is such a thing as *over-reliance* on tests as well.
>

True in principle, but we're so far from that point that you don't have to
worry about that for the foreseeable future.

> Tests help but it is not a black or white kind of thing as seems to come
> across in many of the messages on this list about what part of scipy is in
> "good shape" or "easy to maintain" or "has love."    Just because tests
> exist doesn't mean that you can trust the code --- you also then have to
> trust the tests.   Ultimately, trust is built from successful *usage*.
> Tests are only a pseudo-subsitute for that usage.  It so happens that usage
> that comes along with the code itself makes it easier to iterate on changes
> and catch some of the errors that can happen on re-factoring.
>
> In summary, tests are good!  But, they also add overhead and themselves
> must be maintained, and I don't think it helps to disparage working code.
> I've seen a lot of terrible code that has *great* tests and seen projects
> fail because developers focus too much on the tests and not enough on what
> the code is actually doing.   Great tests can catch many things but they
> cannot make up for not paying attention when writing the code.
>

Certainly, but besides giving more confidence that code is correct, a major
advantage is that it is a massive help when working on existing code -
especially for new developers. Now we have to be extremely careful in
reviewing patches to check nothing gets broken (including backwards
compatibility). Tests in that respect are not a maintenance burden, but a
time saver.

As an example, last week I wanted to add a way to easily adjust the
bandwidth of gaussian_kde. This was maybe 10 lines of code, didn't take
long at all. Then I spent some time adding tests and improving the docs,
and thought I was done. After sending the PR, I spent at least an equal
amount of time reworking everything a couple of times to not break any of
the existing subclasses that could be found. In addition it took a lot of
Josef's time to review it all and convince me of the error of my way. A few
tests could have saved us a lot of time.

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120105/193a5210/attachment.html>