[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Thu Feb 26 18:47:08 EST 2009

I think a discussion for a roadmap for stats will be very useful.

Currently my priority is still your point
     7 iii) Ideally there should be tests that check the function accuracy.

I consider this the main point of almost all my work on stats. And there are
still some incorrect parts left.

The next part for the current code base, that I think about, was to
evaluate function
whether they are ok, can be generalized, e.g. dimension, or are
trivial and should be
removed.

Next are changes in the interface and combining or comparing mstats and stats.
Here, I don't have a clear opinion yet of how far we can or want to
consistently generalize all statistical functions to the different
type of arrays. In many cases I looked at, the masked array version
looked sufficiently different that I would be reluctant to merge them.
One radical alternative would be to depreciate stats.stats and expand
mstats, since it is already better designed to handle different array
types. But I like the "simple" versions in stats, and I'm curious
about any speed difference.
But general tools to interface to different array types would be
useful and should be carefully designed, e.g. function like ols that
have a plain ndarray core, but can access the data from structured
arrays and masked arrays.

After, the changes to the current statistical function, I was
considering areas of statistics that have partial but incomplete
coverage. Non-parametric tests are well represented, and I have some
extension for tests for discrete distributions. I think ANOVA, which I
never used myself, has a very incomplete collection, which, I guess is
a historical accident since Gary Strangman had, I think more ANOVA
functions that are not included in stats.
So instead of having a laundry list of functions, (some of which don't
seem to have been used for years), I would prefer at least a
conceptional grouping around statistical topics. Regression of course
is currently MIA.

The next large interface issue, especially for enhancements, is
whether to use functions or proper classes. I think for some
statistical analysis the current statistical function, once cleaned
up, work fine. However, even R returns result classes (or whatever
their equivalent is) for every statistical test, while in python we
use matlab style functions.

This will change when models will be included again.

I have a list of functions that have no test coverage, a list (not
written down) of functions that have bug suspects or known bugs, and
it would be useful to get a wider opinion about which functions and
interfaces are important
Working on the list of functions on the wiki page maybe simpler for
collecting comments than going through the statistical review in trac.

Overall, I think there is still a lot of work to do before I start to
worry about white space issues.

Josef