[SciPy-dev] PEP: Improving the basic statistical functions in Scipy
Pierre GM
pgmdevlist at gmail.com
Fri Feb 27 15:40:59 EST 2009
On Feb 27, 2009, at 3:14 PM, josef.pktd at gmail.com wrote:
> One example:
> In the fit method of the distributions with bounded support, if there
> are observations outside of the bound than the negative log-likelihood
> is set to inf:
>
> cond0 = (x <= self.a) | (x >= self.b)
> if (any(cond0)):
> return inf
> else:
> N = len(x)
> return self._nnlf(x, *args) + N*log(scale)
>
> In this case, it might still produce the correct result since the
> check is before the aggregation. However, this is implementation
> specific. If I had assigned the inf before the summation of the
> log-likelihood contributions, ma.log would have removed them, and
> killed the boundary check.
OK, so you don't want to use the ma functions there. Pb is that you
won't be able to use the np versions on MA either
>>> x = ma.array([0,1,2],mask=[0,1,0])
>>> np.log(x)
masked_array(data = [-- -- 0.69314718056],
mask = [ True True False],
fill_value = 1e+20)
np.log(x) first work on the data, then call MA.__array_wrap__. This
function checks the initial mask, then the context of the function: as
it's a domained function, the entries outside the domain are
transformed into mask.
For this kind of problem, the easiest is to decouple:
1. Take a view of the input as a standard ndarray.
2. Process the view
3. Add the mask of the input if needed.
With the previous example, that'd be roughly
>>> ma.array(np.log(x.view(ndarray)), mask=ma.getmask(x))
masked_array(data = [-inf -- 0.69314718056],
mask = [False True False],
fill_value = 1e+20)
You keep the masked entry at index 1, but don't mask the entry at
index 0.
> So when working with masked array functions, it is necessary to always
> keep in mind that the math is defined differently, which promises many
> happy hours of bug hunting.
Indeed. But once again, the masked versions of the function are more
for convenience. If you need performance, you have to preprocess the
inputs by transforming them into standard ndarrays one way or another.
In the case of correlation functions, for example, you can suppress
missing values pair-wise (that is, drop the entries of x if the
corresponding entries of y are masked, and vice-versa).
For basic linear fit, that might be an approach. A second would be to
work by intervals, the limits of the intervals being a masked value.
For more complex fitting (eg, loess), problems arise. You can bypass
them temporarily by raisong a NotImplementedError if the inputs are
masked, it'd be up to the user to find a way to fill the inputs.
More information about the SciPy-Dev
mailing list