[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Fri Feb 27 16:52:11 EST 2009

>
> Indeed. But once again, the masked versions of the function are more
> for convenience. If you need performance, you have to preprocess the
> inputs by transforming them into standard ndarrays one way or another.
> In the case of correlation functions, for example, you can suppress
> missing values pair-wise (that is, drop the entries of x if the
> corresponding entries of y are masked, and vice-versa).
> For basic linear fit, that might be an approach. A second would be to
> work by intervals, the limits of the intervals being a masked value.
> For more complex fitting (eg, loess), problems arise. You can bypass
> them temporarily by raisong a NotImplementedError if the inputs are
> masked, it'd be up to the user to find a way to fill the inputs.

For most of the current statistical functions, with the exception of
different tie handling, I think that we can expand the _chk_asarray to
do the necessary preprocessing.

I also thought that the return for these functions should be easy,
since most of them return statistics and not data arrays. However,
looking at some examples, it is not obvious to me which return result
and type you would like to have. A good example is `moment`, here is
some of the current returns. I think they cover the main return
patterns.

>>> x
array([  0.,   1.,  NaN,   2.])

masked arrays with masked nan:
------------------------------------------------
do you need a masked array as return type, since all values are valid?
how about for t-statistic and p-values? Do p-values need to be masked arrays?

>>> stats.mstats.moment(np.ma.fix_invalid(np.ma.column_stack([x,x])),3)
masked_array(data = [0.0 0.0],
             mask = [False False],
       fill_value = 1e+020)

>>> stats.mstats.moment(np.ma.fix_invalid(np.ma.column_stack([x,x])),2)
masked_array(data = [0.666666666667 0.666666666667],
             mask = [False False],
       fill_value = 1e+020)

>>> stats.mstats.moment(np.ma.fix_invalid(np.ma.column_stack([x,x])),1) #inconsistent return type
array([ 0.,  0.])

masked array without masked values
-----------------------------------------------------
same as above about return type

>>> stats.mstats.moment(np.ma.column_stack([np.arange(4),np.arange(4)]),2)
masked_array(data = [ 1.25  1.25],
             mask = False,
       fill_value = 1e+020)

masked array with nan that is not masked
-------------------------------------------------------------
masked array in, masked array out, nan results converted to mask
is this desired?

>>> stats.mstats.moment(np.ma.column_stack([x,x]),3)
masked_array(data = [-- --],
             mask = [ True  True],
       fill_value = 1e+020)

>>> stats.mstats.moment(np.ma.column_stack([x,x]),0)
masked_array(data = [-- --],
             mask = [ True  True],
       fill_value = 1e+020)

>>> stats.mstats.moment(np.ma.column_stack([x,x]),1)
array([ 0.,  0.])

ndarray with nans
-------------------------
converted to masked array, nans are masked.
here I want to get ndarray with nans returned

>>> stats.mstats.moment(np.column_stack([x,x]),0)
masked_array(data = [-- --],
             mask = [ True  True],
       fill_value = 1e+020)

>>> stats.mstats.moment(np.column_stack([x,x]),1)
array([ 0.,  0.])

ndarray without nans
------------------------------
this should return ndarray

>>> stats.mstats.moment(np.column_stack([np.arange(4),np.arange(4)]),2)
masked_array(data = [ 1.25  1.25],
             mask = False,
       fill_value = 1e+020)

>>> stats.mstats.moment(np.column_stack([np.arange(4),np.arange(4)]),1)
array([ 0.,  0.])

If this return "API" is specified, then it is possible to work out
some examples to see how the merged function works. If you think
converting nans that are the result of calculations to masked arrays
are important, then we could add a keyword argument that implies a
fix_invalid before returning the results.

Josef