[Numpy-discussion] Re: ndarray.fill and ma.array.filled

Fri Apr 7 15:37:03 EDT 2006

Tim Hochberg wrote:
> Eric Firing wrote:
> 
>> Sasha wrote:
>>
>>>
>>>
>>> On 4/7/06, *Tim Hochberg* <tim.hochberg at cox.net 
>>> <mailto:tim.hochberg at cox.net>> wrote:
>>>
>>>     ...
>>>     In general, I'm skeptical of adding more methods to the ndarray 
>>> object
>>>     -- there are plenty already.
>>>
>>>
>>> I've also proposed to drop "fill" in favor of optimizing x[...] = 
>>> <scalar>.  Having both "fill" and "filled" in the interface is plain 
>>> awkward.  You may like the combined proposal better because it does 
>>> not change the total number of methods :-)
>>>  
>>>
>>>     In addition, it appears that both the method and function 
>>> versions of
>>>     filled are "dangerous" in the sense that they sometimes return the
>>>     array
>>>     itself and sometimes a copy.
>>>
>>>
>>> This is true in ma, but may certainly be changed.
>>>  
>>>
>>>     Finally, changing ndarray to support masked array feels a bit 
>>> like the
>>>     tail wagging the dog.
>>>
>>> I disagree. Numpy is pretty much alone among the array languages 
>>> because it does not have "native" support for missing values. For 
>>> the  floating point types some rudimental support for nans exists, 
>>> but is not really usable.  There is no missing values machanism for 
>>> integer types.  I believe adding "filled" and maybe "mask" to ndarray 
>>> (not necessarily under these names) could be a meaningful step 
>>> towards "native" support for missing values.  
>>
>>
>>
>> I agree strongly with you, Sasha.  I get the impression that the world 
>> of numerical computation is divided into those who work with idealized 
>> "data", where nothing is missing, and those who work with real 
>> observations, where there is always something missing.
> 
> 
> I think your experience is clouding your judgement here. Or at least 
> this comes off as unnecessarily perjorative. There's a large class of 
> people who work with data that doesn't have missing values either 
> because of the nature of data acquisition or because they're doing 
> simulations. I take zillions of measurements with digital oscillopscopes 
> and they *never* have missing values. Clipped values, yes, but even if I 
> somehow could queery the scope about which values were actually clipped 
> or simply make an educated guess based on their value, the facilities of 
> ma would be useless to me. The clipped values are what I would want in 
> any case.  I also do a lot of work with simulations derived from this 
> and other data. I don't come across missing values here but again, if I 
> did, the way ma works would not help me. I'd have to treat them either 
> by rejecting the data outright or by some sort of interpolation.

Tim,

The point is well-taken, and I apologize.  I stated my case badly.  (I 
would be delighted if I did not have to be concerned with missing 
values-they are a pain regardless of how well a numerical package 
handles them.)

> 
>> As an oceanographer, I am solidly in the latter category.  If good 
>> support for missing values is not built in, it has to be bolted on, 
>> and it becomes clunky and awkward.  
> 
> 
> This may be a false dichotomy. It's certainly not obvious to me that 
> this is so. At least if "bolted on" means "not adding a filled method to 
> ndarray".

I probably overstated it, but I think we actually agree.  I intended to 
lend support to the priority of making missing-value support as seamless 
and painless as possible.  It will help some people, and not others.

> 
>> I was reluctant to speak up about this earlier because I thought it 
>> was too much to ask of Travis when he was in the midst of putting 
>> numpy on solid ground.  But I am delighted that missing value support 
>> has a champion among numpy developers, and I agree that now is the 
>> time to change it from "bolted on" to "integrated".
> 
> 
> 
> I have no objection to ma support improving. In fact I think it would be 
> great although I don't forsee it helping me anytime soon. I also support 
> Sasha's goal of being able to mix  MaskedArrays and ndarrays reasonably 
> seemlessly.
> 
> However, I do think the situation needs more thought. Slapping filled 
> and mask onto ndarray is the path of least resistance, but it's not 
> clear that it's the best one.
> 
> If we do decide we are going to add both of these methods to ndarray 
> (with filled returning a copy!), then it may worth considering making 
> ndarray a subclass of MaskedArray. Conceptually this makes sense, since 
> at this point an ndarray will just be a MaskedArray where mask is always 
> False. I think that they could share  much of the implementation except 
> that ndarray would be set up to use methods that ignored the mask 
> attribute since they would know that it's always false. Even that might 
> not be worth it, since the check for whether mask is True/False is just 
> a pointer compare.
> 
> It may in fact be best just to do away with MaskedArray entirely, moving 
> the functionality into ndarray. That may have performance implications, 
> although I don't seem them at the moment, and I don't know if there are 
> other methods/attributes that this would imply need to be moved over, 
> although it looks like just mask, filled and possibly filled_value, 
> although the latter looks a little dubious to me.
> 

This is exactly the option that I was afraid to bring up because I 
thought it might be too disruptive, and because I am not contributing to 
numpy, and probably don't have the competence (or time) to do so.

> Either of the above two options would certainly improve the quality of 
> MaskedArray. Copy for instance seems not to have been implemented, and 
> who knows what other dark corners remain unexplored here.
> 
> There's a whole spectrum of possibilities here from ones that don't 
> intrude on ndarray at all to ones that profoundly change it. Sasha's 
> suggestion looks like it's probably the simplest thing in the short 
> term, but I don't know that it's the best long term solution. I think it 
> needs more thought and discussion, which is after all what Sasha asked 
> for ;)

Exactly!  Thank you for broadening the discussion.

Eric