[Numpy-discussion] Re: ndarray.fill and ma.array.filled

Fri Apr 7 15:54:01 EDT 2006

Folks,
I'm more or less in Eric's field (hydrology), and we do have to deal with 
missing values, that we can't interpolate straightforwardly (that is, without 
some dark statistical magic). Purely discarding the data is not an option 
either. MA fills the need, most of it.

I think one of the issues is what is meant by 'masked data':
- a missing observation ? 
- a NAN ?
- a data we don't want to consider at one particular point ?
For the last point, think about raster maps or bitmaps: calculations should be 
performed on a chunk of data, the initial data left untouched, and the result 
should both have the same size as the original, and valid only on the initial 
chunk. The current MA implementation, with its _data part and is _mask part, 
works nicely for the 3rd point.

- I wonder whether implementing a 'filled' method for ndarrays is really 
better than letting the user create a MaskedArray, where the NANs are 
masked.In any case, a 'filled' method should always return a copy, as it's no 
longer the initial data.

- I'm not sure what to do with the idea of making ndarray a subclass of MA . 
One on side, Tim pointed rightly that a ndarray is just a MA with a 'False' 
mask. Actually, I'm a bit frustrated with the standard 'asarray' that shows 
up in many functions. I'd prefer something like "if the argument is a 
non-numpy sequence (tuples,lists), transforming it in a ndarray, but if it's 
already a ndarray or a MA, leave it as it is. Don't touch the mask if 
present". That's how MA.asarray works, but unfortunately  the std "asarray" 
gets rid of the mask (and you end up with something which is not what you'd 
expect). A 'mask=False' attribute in ndarray would be nice.

On another, some methods/functions make sense only on unmasked ndarray (FFT, 
solving equations), some others are a bit tricky to implement (diff ? 
median...). Some exception could be raised if the arguments of these 
functions return True with ismasked (cf below), or that could be simplified 
if 'mask' was a default attribute of numarrays.
I regularly have to use a ismasked function (cf below). 
def ismasked(a):
    if hasattr(a,'mask'):
        return a.mask.any()
    else:
        return False

We're going towards MA as the default object.

But then again, what would be the behavior to deal with missing values ? Using  
R-like na.actions ? That'd be great, but it's getting more complex. 

Oh, and another thing: if 'mask', or 'masked' becomes a default attribute of 
ndarrays, how do we define a mask? As a boolean ndarray whose 'mask' is 
always 'False' ? How do you __repr__ it ?

- I agree that 'filled_value' is not very useful. If I want to fill an array, 
I'm happy to specify what value I want it filled with. In facts, I'd be 
happier to specifiy 'values'. I often have to work with 2D arrays, each 
column representing a different variable. If this array has to be filled, I'd 
like each column to be filled with one particular value, not necessarily the 
same along all columns: something like

column_stack([A[:,k].filled(filler[k]) for k in range(A.shape[1])]) 

with filler a 1xA.shape[1] array of filling values. Of course, we could 
imagine the same thing for rows, or higher dimensions...

Sorry for the rants...