[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Fri Feb 27 10:03:03 EST 2009

josef.pktd at gmail.com wrote:
> I think a discussion for a roadmap for stats will be very useful.
>
> Currently my priority is still your point
>      7 iii) Ideally there should be tests that check the function accuracy.
>
> I consider this the main point of almost all my work on stats. And there are
> still some incorrect parts left.
>   
Yes, that is why I added it.
> The next part for the current code base, that I think about, was to
> evaluate function
> whether they are ok, can be generalized, e.g. dimension, or are
> trivial and should be
> removed.
>   
I agree as I do think some are a consequence of the porting process and 
have never received the appropriate followup over time.
> Next are changes in the interface and combining or comparing mstats and stats.
> Here, I don't have a clear opinion yet of how far we can or want to
> consistently generalize all statistical functions to the different
> type of arrays. In many cases I looked at, the masked array version
> looked sufficiently different that I would be reluctant to merge them.
> One radical alternative would be to depreciate stats.stats and expand
> mstats, since it is already better designed to handle different array
> types. But I like the "simple" versions in stats, and I'm curious
> about any speed difference.
>   
The main issue that prevents me from going further with this aspect!

I do not find it that radical at all to suggest that as I am for just 
using masked arrays because I do not perceive a speed difference. (Okay 
I am perhaps unusual in that I work with large datasets and complex 
models so differences of a few seconds are not that meaningful to me.) 
It would be less work to convert the missing as there are about 85 
functions missing from masked.

> But general tools to interface to different array types would be
> useful and should be carefully designed, e.g. function like ols that
> have a plain ndarray core, but can access the data from structured
> arrays and masked arrays.
>
> After, the changes to the current statistical function, I was
> considering areas of statistics that have partial but incomplete
> coverage. Non-parametric tests are well represented, and I have some
> extension for tests for discrete distributions. I think ANOVA, which I
> never used myself, has a very incomplete collection, which, I guess is
> a historical accident since Gary Strangman had, I think more ANOVA
> functions that are not included in stats.
> So instead of having a laundry list of functions, (some of which don't
> seem to have been used for years), I would prefer at least a
> conceptional grouping around statistical topics. Regression of course
> is currently MIA.
>
>   
Even after Robert's reply on that, stats.py at least still has 
linregress (simple regression with one variable) and glm that address 
these. However, there is a strong case that both of these should also be 
removed in favor of a better approach.

I agree that doing things like general linear models (eg regression and 
ANOVA assuming normality), generalized linear models and such need a 
careful design that integrates where possible existing solutions. Even 
SAS has different procedures and different modules are available for R 
to do these. But must be a separate discussion.

> The next large interface issue, especially for enhancements, is
> whether to use functions or proper classes. I think for some
> statistical analysis the current statistical function, once cleaned
> up, work fine. However, even R returns result classes (or whatever
> their equivalent is) for every statistical test, while in python we
> use matlab style functions.
>
> This will change when models will be included again.
>   
Excellent!
> I have a list of functions that have no test coverage, a list (not
> written down) of functions that have bug suspects or known bugs, and
> it would be useful to get a wider opinion about which functions and
> interfaces are important
> Working on the list of functions on the wiki page maybe simpler for
> collecting comments than going through the statistical review in trac.
>   
I agree that we need to address what functions we really need and what 
interface is required. From that we can address the required tests and 
documentation.

> Overall, I think there is still a lot of work to do before I start to
> worry about white space issues.
>   
Yeah,  I just figured that we should correct any of these coding styles 
issues on the way.

Thanks for all the comments,
Bruce