[Numpy-discussion] Re: ndarray.fill and ma.array.filled
Eric Firing
efiring at hawaii.edu
Fri Apr 7 15:37:03 EDT 2006
Tim Hochberg wrote:
> Eric Firing wrote:
>
>> Sasha wrote:
>>
>>>
>>>
>>> On 4/7/06, *Tim Hochberg* <tim.hochberg at cox.net
>>> <mailto:tim.hochberg at cox.net>> wrote:
>>>
>>> ...
>>> In general, I'm skeptical of adding more methods to the ndarray
>>> object
>>> -- there are plenty already.
>>>
>>>
>>> I've also proposed to drop "fill" in favor of optimizing x[...] =
>>> <scalar>. Having both "fill" and "filled" in the interface is plain
>>> awkward. You may like the combined proposal better because it does
>>> not change the total number of methods :-)
>>>
>>>
>>> In addition, it appears that both the method and function
>>> versions of
>>> filled are "dangerous" in the sense that they sometimes return the
>>> array
>>> itself and sometimes a copy.
>>>
>>>
>>> This is true in ma, but may certainly be changed.
>>>
>>>
>>> Finally, changing ndarray to support masked array feels a bit
>>> like the
>>> tail wagging the dog.
>>>
>>> I disagree. Numpy is pretty much alone among the array languages
>>> because it does not have "native" support for missing values. For
>>> the floating point types some rudimental support for nans exists,
>>> but is not really usable. There is no missing values machanism for
>>> integer types. I believe adding "filled" and maybe "mask" to ndarray
>>> (not necessarily under these names) could be a meaningful step
>>> towards "native" support for missing values.
>>
>>
>>
>> I agree strongly with you, Sasha. I get the impression that the world
>> of numerical computation is divided into those who work with idealized
>> "data", where nothing is missing, and those who work with real
>> observations, where there is always something missing.
>
>
> I think your experience is clouding your judgement here. Or at least
> this comes off as unnecessarily perjorative. There's a large class of
> people who work with data that doesn't have missing values either
> because of the nature of data acquisition or because they're doing
> simulations. I take zillions of measurements with digital oscillopscopes
> and they *never* have missing values. Clipped values, yes, but even if I
> somehow could queery the scope about which values were actually clipped
> or simply make an educated guess based on their value, the facilities of
> ma would be useless to me. The clipped values are what I would want in
> any case. I also do a lot of work with simulations derived from this
> and other data. I don't come across missing values here but again, if I
> did, the way ma works would not help me. I'd have to treat them either
> by rejecting the data outright or by some sort of interpolation.
Tim,
The point is well-taken, and I apologize. I stated my case badly. (I
would be delighted if I did not have to be concerned with missing
values-they are a pain regardless of how well a numerical package
handles them.)
>
>> As an oceanographer, I am solidly in the latter category. If good
>> support for missing values is not built in, it has to be bolted on,
>> and it becomes clunky and awkward.
>
>
> This may be a false dichotomy. It's certainly not obvious to me that
> this is so. At least if "bolted on" means "not adding a filled method to
> ndarray".
I probably overstated it, but I think we actually agree. I intended to
lend support to the priority of making missing-value support as seamless
and painless as possible. It will help some people, and not others.
>
>> I was reluctant to speak up about this earlier because I thought it
>> was too much to ask of Travis when he was in the midst of putting
>> numpy on solid ground. But I am delighted that missing value support
>> has a champion among numpy developers, and I agree that now is the
>> time to change it from "bolted on" to "integrated".
>
>
>
> I have no objection to ma support improving. In fact I think it would be
> great although I don't forsee it helping me anytime soon. I also support
> Sasha's goal of being able to mix MaskedArrays and ndarrays reasonably
> seemlessly.
>
> However, I do think the situation needs more thought. Slapping filled
> and mask onto ndarray is the path of least resistance, but it's not
> clear that it's the best one.
>
> If we do decide we are going to add both of these methods to ndarray
> (with filled returning a copy!), then it may worth considering making
> ndarray a subclass of MaskedArray. Conceptually this makes sense, since
> at this point an ndarray will just be a MaskedArray where mask is always
> False. I think that they could share much of the implementation except
> that ndarray would be set up to use methods that ignored the mask
> attribute since they would know that it's always false. Even that might
> not be worth it, since the check for whether mask is True/False is just
> a pointer compare.
>
> It may in fact be best just to do away with MaskedArray entirely, moving
> the functionality into ndarray. That may have performance implications,
> although I don't seem them at the moment, and I don't know if there are
> other methods/attributes that this would imply need to be moved over,
> although it looks like just mask, filled and possibly filled_value,
> although the latter looks a little dubious to me.
>
This is exactly the option that I was afraid to bring up because I
thought it might be too disruptive, and because I am not contributing to
numpy, and probably don't have the competence (or time) to do so.
> Either of the above two options would certainly improve the quality of
> MaskedArray. Copy for instance seems not to have been implemented, and
> who knows what other dark corners remain unexplored here.
>
> There's a whole spectrum of possibilities here from ones that don't
> intrude on ndarray at all to ones that profoundly change it. Sasha's
> suggestion looks like it's probably the simplest thing in the short
> term, but I don't know that it's the best long term solution. I think it
> needs more thought and discussion, which is after all what Sasha asked
> for ;)
Exactly! Thank you for broadening the discussion.
Eric
More information about the NumPy-Discussion
mailing list