[SciPy-Dev] SciPy Goal

Thu Jan 5 08:51:02 EST 2012

On Thu, Jan 5, 2012 at 1:02 AM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
>
>
> On Wed, Jan 4, 2012 at 9:29 PM, Travis Oliphant <travis at continuum.io> wrote:
>>
>>
>> On Jan 4, 2012, at 8:22 PM, Fernando Perez wrote:
>>
>> > Hi all,
>> >
>> > On Wed, Jan 4, 2012 at 5:43 PM, Travis Oliphant <travis at continuum.io>
>> > wrote:
>> >> What do others think is missing?  Off the top of my head:   basic
>> >> wavelets
>> >> (dwt primarily) and more complete interpolation strategies (I'd like to
>> >> finish the basic interpolation approaches I started a while ago).
>> >> Originally, I used GAMS as an "overview" of the kinds of things needed
>> >> in
>> >> SciPy.   Are there other relevant taxonomies these days?
>> >
>> > Well, probably not something that fits these ideas for scipy
>> > one-to-one, but the Berkeley 'thirteen dwarves' list from the 'View
>> > from Berkeley' paper on parallel computing is not a bad starting
>> > point; summarized here they are:
>> >
>> >    Dense Linear Algebra
>> >    Sparse Linear Algebra [1]
>> >    Spectral Methods
>> >    N-Body Methods
>> >    Structured Grids
>> >    Unstructured Grids
>> >    MapReduce
>> >    Combinational Logic
>> >    Graph Traversal
>> >    Dynamic Programming
>> >    Backtrack and Branch-and-Bound
>> >    Graphical Models
>> >    Finite State Machines
>>
>>
>> This is a nice list, thanks!
>>
>> >
>> > Descriptions of each can be found here:
>> > http://view.eecs.berkeley.edu/wiki/Dwarf_Mine and the full study is
>> > here:
>> >
>> > http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
>> >
>> > That list is biased towards the classes of codes used in
>> > supercomputing environments, and some of the topics are probably
>> > beyond the scope of scipy (say structured/unstructured grids, at least
>> > for now).
>> >
>> > But it can be a decent guiding outline to reason about what are the
>> > 'big areas' of scientific computing, so that scipy at least provides
>> > building blocks that would be useful in these directions.
>> >
>>
>> Thanks for the links.
>>
>>
>> > One area that hasn't been directly mentioned too much is the situation
>> > with statistical tools.  On the one hand, we have the phenomenal work
>> > of pandas, statsmodels and sklearn, which together are helping turn
>> > python into a great tool for statistical data analysis (understood in
>> > a broad sense).  But it would probably be valuable to have enough of a
>> > statistical base directly in numpy/scipy so that the 'out of the box'
>> > experience for statistical work is improved.  I know we have
>> > scipy.stats, but it seems like it needs some love.
>>
>> It seems like scipy stats has received quite a bit of attention.   There
>> is always more to do, of course, but I'm not sure what specifically you
>> think is missing or needs work.
>
>
>
> Test coverage, for example.  I recently fixed several wildly incorrect
> skewness and kurtosis formulas for some distributions, and I now have very
> little confidence that any of the other distributions are correct.  Of
> course, most of them probably *are* correct, but without tests, all are in
> doubt.

Actually for this part it's not so much the test coverage, I have
written some imperfect tests, but they are disabled because skew,
kurtosis (3rd and 4th moments) and entropy still have several bugs for
sure.
One problem is that they are statistical tests with some false alarms,
especially for distributions that are far away from the normal.

But the main problem is that it requires a lot of work fixing those
bugs, find the correct formulas (which is not so easy for some more
exotic distributions) and then finding out where the current
calculations are wrong.
As you have seen for the cases that you recently fixed.

variances (2nd moments) might be ok, but I'm not completely convinced
anymore since I discovered that the corresponding test was a dummy.

Better tests would be useful, but statistical tests based on random
samples were the only once I could come up with at the time that
(mostly) worked across all 100 distributions.

Josef

>
> Warren
>
>
>>    A big question to me is the impact of data-frames as the underlying
>> data-representation of the algorithms and the relationship between the
>> data-frame and a NumPy array.
>>
>> -Travis
>>
>>
>> >
>> > Cheers,
>> >
>> > f
>> > _______________________________________________
>> > SciPy-Dev mailing list
>> > SciPy-Dev at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>