[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Fri Feb 27 11:19:24 EST 2009

On Fri, Feb 27, 2009 at 10:03 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> josef.pktd at gmail.com wrote:
>> I think a discussion for a roadmap for stats will be very useful.
>>
>> Currently my priority is still your point
>>      7 iii) Ideally there should be tests that check the function accuracy.
>>
>> I consider this the main point of almost all my work on stats. And there are
>> still some incorrect parts left.
>>
> Yes, that is why I added it.
>> The next part for the current code base, that I think about, was to
>> evaluate function
>> whether they are ok, can be generalized, e.g. dimension, or are
>> trivial and should be
>> removed.
>>
> I agree as I do think some are a consequence of the porting process and
> have never received the appropriate followup over time.
>> Next are changes in the interface and combining or comparing mstats and stats.
>> Here, I don't have a clear opinion yet of how far we can or want to
>> consistently generalize all statistical functions to the different
>> type of arrays. In many cases I looked at, the masked array version
>> looked sufficiently different that I would be reluctant to merge them.
>> One radical alternative would be to depreciate stats.stats and expand
>> mstats, since it is already better designed to handle different array
>> types. But I like the "simple" versions in stats, and I'm curious
>> about any speed difference.
>>
> The main issue that prevents me from going further with this aspect!
>
> I do not find it that radical at all to suggest that as I am for just
> using masked arrays because I do not perceive a speed difference. (Okay
> I am perhaps unusual in that I work with large datasets and complex
> models so differences of a few seconds are not that meaningful to me.)
> It would be less work to convert the missing as there are about 85
> functions missing from masked.
>

I don't know what the current range of use cases for stats is. But for example
in matlab, I have some ols estimation in an innerloop where I wouldn't want much
overhead. But in this case, it would always be possible to go back to
raw linalg.lstsq.

The other disadvantage for me is that it is much easier to write
functions that work
for plain arrays, since I'm not working with masked/missing data. It's
ok if the handling
of different array types can be done in the interface of the function,
but translating
some statistical formulas into code or porting it from another
language will be more
difficult for me if I have to worry about missing values all the time.
An example that
I looked at recently, is statistical analysis of panel data, with a
balanced panel the
linear algebra and matrix operations are much easier than with an
unbalanced panel.

What I would like to do, but didn't have the time yet is to run the
tests for stats.stats
on stats.mstats. This way even if we would have some duplicate
functions, we would
have some cross check that they are consistent, and it would be a reminder for
bug fixing also the other version.

>
>> But general tools to interface to different array types would be
>> useful and should be carefully designed, e.g. function like ols that
>> have a plain ndarray core, but can access the data from structured
>> arrays and masked arrays.
>>
>> After, the changes to the current statistical function, I was
>> considering areas of statistics that have partial but incomplete
>> coverage. Non-parametric tests are well represented, and I have some
>> extension for tests for discrete distributions. I think ANOVA, which I
>> never used myself, has a very incomplete collection, which, I guess is
>> a historical accident since Gary Strangman had, I think more ANOVA
>> functions that are not included in stats.
>> So instead of having a laundry list of functions, (some of which don't
>> seem to have been used for years), I would prefer at least a
>> conceptional grouping around statistical topics. Regression of course
>> is currently MIA.
>>
>>
> Even after Robert's reply on that, stats.py at least still has
> linregress (simple regression with one variable) and glm that address
> these. However, there is a strong case that both of these should also be
> removed in favor of a better approach.
>

I don't really count linregress as a "serious" statistical function, since
the restriction to one explanatory variable has no computational
advantage if we have access to linalg.
Similarly, I don't know what the purpose of pointbiserial is, if you can use
np.corrcoef for the correlation coefficient or stats.pearsonr for the p-values.
My impression is that these are historical functions, when there was no
easy access to fast computers and full matrix and array packages.

stats.glm is a bit of a misnomer it is just a t-test for the regression on
one dummy variable, not an estimator. But again I don't see an advantage
compared to ols with multivariate regressors and dummy variables.

> I agree that doing things like general linear models (eg regression and
> ANOVA assuming normality), generalized linear models and such need a
> careful design that integrates where possible existing solutions. Even
> SAS has different procedures and different modules are available for R
> to do these. But must be a separate discussion.
>
>> The next large interface issue, especially for enhancements, is
>> whether to use functions or proper classes. I think for some
>> statistical analysis the current statistical function, once cleaned
>> up, work fine. However, even R returns result classes (or whatever
>> their equivalent is) for every statistical test, while in python we
>> use matlab style functions.
>>
>> This will change when models will be included again.

There are still bugs in it, and test coverage is still low. If anyone
wants to help in the review, bug hunting or adding test the
current version is in nipy at
https://code.launchpad.net/~nipy-developers/nipy/trunk-josef-models

>>
> Excellent!
>> I have a list of functions that have no test coverage, a list (not
>> written down) of functions that have bug suspects or known bugs, and
>> it would be useful to get a wider opinion about which functions and
>> interfaces are important
>> Working on the list of functions on the wiki page maybe simpler for
>> collecting comments than going through the statistical review in trac.
>>
> I agree that we need to address what functions we really need and what
> interface is required. From that we can address the required tests and
> documentation.
>
>> Overall, I think there is still a lot of work to do before I start to
>> worry about white space issues.
>>
> Yeah,  I just figured that we should correct any of these coding styles
> issues on the way.

I'm slowly getting used to the formatting requirements, and at least during
code changes, I try to stick to it.

>
> Thanks for all the comments,
> Bruce
>

Josef