[SciPy-user] Ols for np.arrays and masked arrays

Fri Jan 16 17:13:21 EST 2009

Josef,
I'm rewriting your module, expect some update in the next few hours  
(days...).
Still, I have some generic comments:

* If you need a function that supports both ndarrays and masked  
arrays, force the inputs to be masked arrays, that'll be easier.

* Use ma.fix_invalid to transform a ndarray w/ or w/o NaNs into a  
masked array: the NaNs will automatically be masked, and the  
underlying data fixed (a copy is made, no worry).

* if you need to mask an element, just mask it directly: you don't  
have to set it to NaN and then use np.isnan for the mask. So, instead  
of:
x_0 = x[:,0].copy()
x_0[0] = np.nan
x_0 = ma.masked_array(x_0, np.isnan(x_0))

just do:
x_0 = ma.array(x[:,0])
x_0[0] = ma.masked

* When mask is a boolean ndarray, just use x[~mask] instead of  
x[mask==False].

* To get rid of the missing data in x, use x.compressed() or emulate  
it with x.data[~ma.getmaskarray(x)]. ma.getmaskarray(x) always returns  
a ndarray with the same length as x, whereas ma.getmask(x) can return  
nomask.

* when manipulating masked arrays, if performance is an issue,  
decompose the process in manipulating the data and the mask  
separately. The easiest is to use .filled to get a pure ndarray for  
the data. The choice of the fill_value depends on the application. In  
ma.polyfit, we fill y with 0, which doesn't really matter as the  
corresponding coefficients of x will be 0 (through vander).

> Also this is the first time, that
> I use masked arrays, and I'm not sure I found the best way.

Don't worry, practice makes perfect.

>
> I treat masked arrays or nans by removing all observations with masked
> or nan values for the internal calculation, but keep the mask around,
> for the case when it is needed for some output.

You keep the *common* mask, which sounds OK. Removing the missing  
observations seems the way to go

> I looked at
> np.ma.polyfit, which uses a dummy fill value (0) before calling least
> squares, but in general it will be difficult to find "neutral" fill
> values..

cf explanation above.
>