[SciPy-dev] PEP: Improving the basic statistical functions in Scipy

Fri Feb 27 15:40:59 EST 2009

On Feb 27, 2009, at 3:14 PM, josef.pktd at gmail.com wrote:
> One example:
> In the fit method of the distributions with bounded support, if there
> are observations outside of the bound than the negative log-likelihood
> is set to inf:
>
>        cond0 = (x <= self.a) | (x >= self.b)
>        if (any(cond0)):
>            return inf
>        else:
>            N = len(x)
>            return self._nnlf(x, *args) + N*log(scale)
>
> In this case, it might still produce the correct result since the
> check is before the aggregation. However, this is implementation
> specific. If I had assigned the inf before the summation of the
> log-likelihood contributions, ma.log would have removed them, and
> killed the boundary check.

OK, so you don't want to use the ma functions there. Pb is that you  
won't be able to use the np versions on MA either
 >>>  x = ma.array([0,1,2],mask=[0,1,0])
 >>> np.log(x)
masked_array(data = [-- -- 0.69314718056],
              mask = [ True  True False],
        fill_value = 1e+20)
np.log(x) first work on the data, then call MA.__array_wrap__. This  
function checks the initial mask, then the context of the function: as  
it's a domained function, the entries outside the domain are  
transformed into mask.

For this kind of problem, the easiest is to decouple:
1. Take a view of the input as a standard ndarray.
2. Process the view
3. Add the mask of the input if needed.

With the previous example, that'd be roughly
 >>> ma.array(np.log(x.view(ndarray)), mask=ma.getmask(x))
masked_array(data = [-inf -- 0.69314718056],
              mask = [False  True False],
        fill_value = 1e+20)

You keep the masked entry at index 1, but don't mask the entry at  
index 0.

> So when working with masked array functions, it is necessary to always
> keep in mind that the math is defined differently, which promises many
> happy hours of bug hunting.

Indeed. But once again, the masked versions of the function are more  
for convenience. If you need performance, you have to preprocess the  
inputs by transforming them into standard ndarrays one way or another.
In the case of correlation functions, for example, you can suppress  
missing values pair-wise (that is, drop the entries of x if the  
corresponding entries of y are masked, and vice-versa).
For basic linear fit, that might be an approach. A second would be to  
work by intervals, the limits of the intervals being a masked value.
For more complex fitting (eg, loess), problems arise. You can bypass  
them temporarily by raisong a NotImplementedError if the inputs are  
masked, it'd be up to the user to find a way to fill the inputs.