[SciPy-dev] Homogenizing stats & mstats

Fri Jul 24 16:12:49 EDT 2009

On 07/24/2009 01:23 PM, Pierre GM wrote:
> On Jul 24, 2009, at 11:14 AM, Bruce Southey wrote:
>    
>> As I now think about these functions, the stats functions do need to
>> split into at least two parts such as descriptive stats like
>> geometric mean (gmean) and statistical test functions like
>> kendalltau.  Perhaps even adding a set of utility functions like
>> tmax, tmean and tmin (but these are limited to one dimensional
>> arrays).
>>      
>
> That was my intention as well to split stats into 2 or 3 files.
> Descriptive stats (means&  quantiles) on one side, tests on the other
> sound good. Should we start creating these files already side by side
> with the current stats/mstats files ? Should we create a branch ?
>    
Just go ahead and do what you want! :-)
The real issue is whether or not the stats files will be replaced by the 
new versions or be a new entity (that could then replace the old 
versions). Initially it would good to keep the old versions around to 
check and test functionality.

>    
>> We also need to address ticket 604:'Statistics functions with new
>> options' at the same time.
>> http://projects.scipy.org/scipy/ticket/604
>>      
>
> Indeed.
>
>    
>>> * A second step would be to use numpy.ma under the hood, returning
>>> either a MaskedArray if the input is a MaskedArray itself, or just a
>>> standard ndarray otherwise.
>>>        
>> Really I think that the input object must be preserved unless the
>> user states otherwise. One aspect is that masked arrays
>> automatically masks any noninfinite elements like infinity. For
>> certain stats it is essentially to know that this has occurred as it
>> signals a larger problem but automatically masking this hides this
>> problem. For example:
>> c=np.ma.masked_array([1.,2.,3., np.nan], [1,0,0,0] # provides a
>> masked array with NaN
>> c/2 # automatically masks the np.nan which is fine if you know but
>> not if you do not want nonfinite values masked.
>>      
>
> OK, I see the problem here. We could have this usemask tell us whether
> to use the MA behaviour (invalid output are masked, a MA is output no
> matter the type of the input) or not (NaN/Infs are preserved, a
> standard ndarray is output no matter the type of the input).
> Nevertheless, some of the functions (ranking, tests with ties) work
> correctly in mstats and not in stats (compared to R): we could use the
> mstats implementation instead of the stats one, then.
>
>
>
>    
>> It would be great to have at least the Matrix class work (record/
>> structured arrays and even sparse arrays as well) but I do not how
>> sufficient about these to know how.
>>      
>
> Not too much a problem for descriptive stats on Matrix if we use
> np.asanyarray. Structured arrays are a different beast, as the
> standard functions (+-/*...) don't work (for a good reason, and this
> may change later on). I've no experience on sparse arrays, so count me
> out on this one.
>    
Sounds like Matrix should be sufficiently easy to incorporate and we 
leave the rest on the wish list.

>
>
>    
>>> * A third would be to port the remaining routines of mstats.extras to
>>> stats or morestats (Harrell-Davies quantiles could be imlemented more
>>> efficiently in cython, for example).
>>>
>>> At each step, we could add a Deprecate warning to a reviewed mstat
>>> function and call the corresponding stat function instead.
>>>
>>>        
>> Unfortunately there is not a one to one matching between the stats
>> and mstats functions.
>>      
>
> Mmh, if we proceed methodically, that shouldn't be too much of a
> problem. Name differences can be easily adressed. Behavior differences
> are trickier, but may be just bugs waiting for us.
>
>
>    
>> When I started I found 178 functions between the different modules
>> including some that are or should be depreciated. Only about 40
>> functions (plus a few that should be removed) that have the same
>> name in the stats and masked_basic files. I have not checked these
>> to know if these have the exact same behavior as expected by the
>> input type. There are others that perhaps only differ in name.
>>      
>
>    
>>> What would be a good time line ? 0.8.0, or is it too late? 0.9.0 ?
>>>        
>> For 0.8 I think we must at least warn users changes are comming for
>> the stats and mstats as well as make sure that any unnecessary
>> functions are depreciated. Also we could start the process to
>> reorganize the stats functions and  combine the stats and mstats
>> functions with the same name and behavior.
>>      
>
> When is 0.8.0 supposed to be released ? If it's a matter of just a
> couple of weeks, we can sit on the issue as long as needed. If it's
> longer than that, we should probably get started now.
>
>    

While I can not help immediately with this, some I had submitted patches 
for. So hopefully the following will help.

These functions just rename existing functions and perhaps the renaming, 
as necessary, should be elsewhere (like the distributions):
chisqprob
erfc
fprob
ksprob
zprob

These function are/should be depreciated
mean
median
std
var
samplestd
samplevar

I thought that these could be replaced by a one liner using the compress 
method because these only work for 1d arrays; ie for some cutoff values 
minval and maxval:
tmean    a.compress((a>minval) & (a<maxval)).mean()
tmin    a.compress((a>minval) & (a<maxval)).min()
tsem    a.compress((a>minval) & (a<maxval)).std() with df=n
tstd    a.compress((a>minval) & (a<maxval)).std() with df=n-1
tvar    a.compress((a>minval) & (a<maxval)).var()

Actually these probably should be depreciated in favor of the mstats 
approach for trimmed_mean etc that have an axis keyword indicating the 
support for multiple dimensions.

Below is a list I complied for the different functions that have the 
same name in both stats and mstats (really mstats_basic). For the most 
part these have the same arguments but not always.  Also some are or 
should be depreciated or are unnecessary.
_chk_asarray
_chk2_asarray
betai
chisquare
describe
f_oneway
f_value_wilks_lambda
friedmanchisquare
gmean
hmean
kendalltau
kurtosis
kurtosistest
linregress
mannwhitneyu
mode
moment
normaltest
obrientransform
pearsonr
pointbiserialr
rankdata
samplestd
samplevar
scoreatpercentile
sem
signaltonoise
skew
skewtest
spearmanr
std
stderr
threshold
tmax
tmean
tmin
trimboth
tsem
ttest_ind
ttest_rel
tvar
var
variation
z
zmap
zs
find_repeats

Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20090724/c9b3cf06/attachment.html>