[SciPy-Dev] stats.nanstd interface

Sun Jun 20 21:44:01 EDT 2010

On Wed, Jun 16, 2010 at 12:26 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On Wed, Jun 16, 2010 at 10:17 AM, Bruce Southey <bsouthey at gmail.com> wrote:
>> On 06/16/2010 09:20 AM, josef.pktd at gmail.com wrote:
>>>
>>> On Wed, Jun 16, 2010 at 10:02 AM, Bruce Southey<bsouthey at gmail.com>
>>>  wrote:
>>>
>>>>
>>>> On 06/16/2010 07:55 AM, Angus McMorland wrote:
>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I've just updated the docstring for scipy.stats.nanstd to the new
>>>>> docstring standard's format. I wonder if, for consistency of
>>>>> interface, we should consider changing it to use a `ddof` parameter,
>>>>> as numpy's std function does, instead of its current `bias` boolean
>>>>> parameter. I'm aware that there are deprecation/API implications
>>>>> associated with this, but I'm not sure what the specifics of those
>>>>> are.
>>>>>
>>>>> Angus.
>>>>>
>>>>>
>>>>
>>>> Please file a ticket for it.
>>>> Can you please add all the differences between the signature between
>>>> numpy's version and this version?
>>>> In particular, the default axis of stats.nanstd is zero compared to None.
>>>> It also lacks the dtype argument.
>>>>
>>>
>>> default axis in scipy.stats is zero not None as in numpy.
>>> np.nansum has no dtype argument, nans can be only in float (I never
>>> checked complex for this), so I don't know whether dtype would be
>>> useful in this case.
>>>
>>
>> From np.std docstring:
>> "
>>    dtype : dtype, optional
>>        Type to use in computing the standard deviation. For arrays of
>>        integer type the default is float64, for arrays of float types it is
>>        the same as the array type.
>> "
>>
>>>
>>>>
>>>> Really the function needs at least a rewrite unless numpy can provide
>>>> same functionality.
>>>>
>>>
>>> Can you be more specific, we just rewrote axis handling
>>>
>>> I think switching to ddof is a good idea. (FYI: I cannot work on
>>> anything for another two weeks).
>>>
>>> Josef
>>>
>>
>> I know that the broadcasting is not correct in the following but I do not
>> know how to fix it.
>> Also, np.nansum does not accept the dtype so need to convert the input to
>> the new precision.
>>
>> I would like it to handle other array subtypes or at least fail to work on
>> inputs like masked arrays, Matrix class etc.
>>
>> Perhaps something like this works:
>>
>>
>> import numpy as np
>> import scipy.stats as stats
>>
>> def nanstd(x, axis=None, dtype=None, ddof=0):
>>   if dtype == np.float128:   #only convert if desired input is  better than
>> the default float64 dtype
>>        x=np.array(x, dtype=dtype)
>>    denom=np.isfinite(x).sum(axis=axis) # number of finite numbers
>>    mean=np.nansum(x, axis=axis)/denom # This is not correct because the
>> broadcasting is wrong for axis >0
>>    diff=a-mean # a minus the mean - which must broadcast correctly
>>    return np.sqrt(np.nansum(diff*diff, axis=axis)/(denom-ddof))
>>
>> a=np.array([[1,2,3], [4, np.nan, 5], [6, 7, np.nan]])
>> print 'stdnan=:', stdnan(a, axis=None), 'stats.nanstd=:',
>> stats.nanstd(a,axis=None, bias=1)
>> print 'stdnan=:', stdnan(a, axis=None, ddof=1), 'stats.nanstd=:',
>> stats.nanstd(a,axis=None, bias=0)
>> print 'stdnan=:', stdnan(a, axis=0), 'stats.nanstd=:',
>> stats.nanstd(a,axis=0, bias=1)
>> print 'stdnan=:', stdnan(a, axis=0, ddof=1), 'stats.nanstd=:',
>> stats.nanstd(a,axis=0, bias=0)
>> print 'The following is wrong because the broadcasting is not correct when
>> computing the difference'
>> print 'stdnan=:', stdnan(a, axis=1), 'stats.nanstd=:',
>> stats.nanstd(a,axis=1, bias=1)
>>
>> Bruce
>>
>
> Thanks Angus for the ticket 1200:
> http://projects.scipy.org/scipy/ticket/1200
>
> I added code to the ticket that I think fixes the broadcasting issue I
> mentioned above and added 'support' for masked array input. Also I
> created the variance function as standard deviation is the square root
> of variance.
>
> I really think that all these stats 'nan functions' probably could
> just be converted into masked arrays and using the appropriate masked
> array functions instead of creating separate functions. This would
> also address how to handle the 'out' argument.
>

Someone can correct me if I'm wrong, but I believe that there is a
performance hit for using masked arrays over the nan functions.  Wes
and Keith have mentioned it wrt pandas and larry, if I recall.

Skipper