[SciPy-dev] Homogenizing stats & mstats

Bruce Southey bsouthey at gmail.com
Fri Jul 24 11:14:18 EDT 2009


On 07/24/2009 01:15 AM, Pierre GM wrote:
> All,
> I was browsing some recent tickets for scipy.stats, and couldn't but
> noticed that a significant number of them (#845, #822, #901...),  are
> related to some lack of consistency between stats and mstats.
>
> I'd like to eventually get rid of mstats all together, provided the
> same functionalities are supported in stats.
>    
Yeah, that would be great but I ran out of steam to do more and have not 
found the time to go back.


> * A first step would be to use np.asanyarray instead of np.asarray.
> That should be sufficient for functions like gmean and hmean for
> example.
>    
Well there should be a couple of patches for those two.
http://projects.scipy.org/scipy/ticket/907
http://projects.scipy.org/scipy/ticket/908

It was not clear if some functions should be in scipy or even stats at 
least in their current form (this made me stop what I was doing). I 
really hope that Numpy will eventually provide for something like 
nanmean and nanstd. In some cases these appeared to limited to specific 
array dimensions (trimboth), others appear to be one liners and those 
with different names but may be the same function (trim_mean and 
trimmed_mean).

As I now think about these functions, the stats functions do need to 
split into at least two parts such as descriptive stats like geometric 
mean (gmean) and statistical test functions like kendalltau.  Perhaps 
even adding a set of utility functions like tmax, tmean and tmin (but 
these are limited to one dimensional arrays).

We also need to address ticket 604:'Statistics functions with new 
options' at the same time.
http://projects.scipy.org/scipy/ticket/604

> * A second step would be to use numpy.ma under the hood, returning
> either a MaskedArray if the input is a MaskedArray itself, or just a
> standard ndarray otherwise. That should take care of the functions
> related to ranking and tie handling (I'm pretty confident into the
> mstats routines, and we can always double-check the results w/ R). If
> needed, we could also add a usemask flag, like we do in
> np.io.genfromtxt.
>    

Really I think that the input object must be preserved unless the user 
states otherwise. One aspect is that masked arrays automatically masks 
any noninfinite elements like infinity. For certain stats it is 
essentially to know that this has occurred as it signals a larger 
problem but automatically masking this hides this problem. For example:
c=np.ma.masked_array([1.,2.,3., np.nan], [1,0,0,0] # provides a masked 
array with NaN
c/2 # automatically masks the np.nan which is fine if you know but not 
if you do not want nonfinite values masked.

It would be great to have at least the Matrix class work 
(record/structured arrays and even sparse arrays as well) but I do not 
how sufficient about these to know how.


> * A third would be to port the remaining routines of mstats.extras to
> stats or morestats (Harrell-Davies quantiles could be imlemented more
> efficiently in cython, for example).
>
> At each step, we could add a Deprecate warning to a reviewed mstat
> function and call the corresponding stat function instead.
>    
Unfortunately there is not a one to one matching between the stats and 
mstats functions. When I started I found 178 functions between the 
different modules including some that are or should be depreciated. Only 
about 40 functions (plus a few that should be removed) that have the 
same name in the stats and masked_basic files. I have not checked these 
to know if these have the exact same behavior as expected by the input 
type. There are others that perhaps only differ in name.

> What would be a good time line ? 0.8.0, or is it too late? 0.9.0 ?
>    
For 0.8 I think we must at least warn users changes are comming for the 
stats and mstats as well as make sure that any unnecessary functions are 
depreciated. Also we could start the process to reorganize the stats 
functions and  combine the stats and mstats functions with the same name 
and behavior.

> Comments expected.
>    
Always! :-)
> Thx in advance
> P.
>    

Bruce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20090724/f36f1f87/attachment.html>


More information about the SciPy-Dev mailing list