[Numpy-discussion] Re: ndarray.fill and ma.array.filled

Fri Apr 7 14:16:06 EDT 2006

Eric Firing wrote:

> Sasha wrote:
>
>>
>>
>> On 4/7/06, *Tim Hochberg* <tim.hochberg at cox.net 
>> <mailto:tim.hochberg at cox.net>> wrote:
>>
>>     ...
>>     In general, I'm skeptical of adding more methods to the ndarray 
>> object
>>     -- there are plenty already.
>>
>>
>> I've also proposed to drop "fill" in favor of optimizing x[...] = 
>> <scalar>.  Having both "fill" and "filled" in the interface is plain 
>> awkward.  You may like the combined proposal better because it does 
>> not change the total number of methods :-)
>>  
>>
>>     In addition, it appears that both the method and function 
>> versions of
>>     filled are "dangerous" in the sense that they sometimes return the
>>     array
>>     itself and sometimes a copy.
>>
>>
>> This is true in ma, but may certainly be changed.
>>  
>>
>>     Finally, changing ndarray to support masked array feels a bit 
>> like the
>>     tail wagging the dog.
>>
>> I disagree. Numpy is pretty much alone among the array languages 
>> because it does not have "native" support for missing values. For 
>> the  floating point types some rudimental support for nans exists, 
>> but is not really usable.  There is no missing values machanism for 
>> integer types.  I believe adding "filled" and maybe "mask" to ndarray 
>> (not necessarily under these names) could be a meaningful step 
>> towards "native" support for missing values.  
>
>
> I agree strongly with you, Sasha.  I get the impression that the world 
> of numerical computation is divided into those who work with idealized 
> "data", where nothing is missing, and those who work with real 
> observations, where there is always something missing.

I think your experience is clouding your judgement here. Or at least 
this comes off as unnecessarily perjorative. There's a large class of 
people who work with data that doesn't have missing values either 
because of the nature of data acquisition or because they're doing 
simulations. I take zillions of measurements with digital oscillopscopes 
and they *never* have missing values. Clipped values, yes, but even if I 
somehow could queery the scope about which values were actually clipped 
or simply make an educated guess based on their value, the facilities of 
ma would be useless to me. The clipped values are what I would want in 
any case.  I also do a lot of work with simulations derived from this 
and other data. I don't come across missing values here but again, if I 
did, the way ma works would not help me. I'd have to treat them either 
by rejecting the data outright or by some sort of interpolation.

> As an oceanographer, I am solidly in the latter category.  If good 
> support for missing values is not built in, it has to be bolted on, 
> and it becomes clunky and awkward.  

This may be a false dichotomy. It's certainly not obvious to me that 
this is so. At least if "bolted on" means "not adding a filled method to 
ndarray".

> I was reluctant to speak up about this earlier because I thought it 
> was too much to ask of Travis when he was in the midst of putting 
> numpy on solid ground.  But I am delighted that missing value support 
> has a champion among numpy developers, and I agree that now is the 
> time to change it from "bolted on" to "integrated".

I have no objection to ma support improving. In fact I think it would be 
great although I don't forsee it helping me anytime soon. I also support 
Sasha's goal of being able to mix  MaskedArrays and ndarrays reasonably 
seemlessly.

However, I do think the situation needs more thought. Slapping filled 
and mask onto ndarray is the path of least resistance, but it's not 
clear that it's the best one.

If we do decide we are going to add both of these methods to ndarray 
(with filled returning a copy!), then it may worth considering making 
ndarray a subclass of MaskedArray. Conceptually this makes sense, since 
at this point an ndarray will just be a MaskedArray where mask is always 
False. I think that they could share  much of the implementation except 
that ndarray would be set up to use methods that ignored the mask 
attribute since they would know that it's always false. Even that might 
not be worth it, since the check for whether mask is True/False is just 
a pointer compare.

It may in fact be best just to do away with MaskedArray entirely, moving 
the functionality into ndarray. That may have performance implications, 
although I don't seem them at the moment, and I don't know if there are 
other methods/attributes that this would imply need to be moved over, 
although it looks like just mask, filled and possibly filled_value, 
although the latter looks a little dubious to me.

Either of the above two options would certainly improve the quality of 
MaskedArray. Copy for instance seems not to have been implemented, and 
who knows what other dark corners remain unexplored here.

There's a whole spectrum of possibilities here from ones that don't 
intrude on ndarray at all to ones that profoundly change it. Sasha's 
suggestion looks like it's probably the simplest thing in the short 
term, but I don't know that it's the best long term solution. I think it 
needs more thought and discussion, which is after all what Sasha asked 
for ;)

Regards,

-tim