[Numpy-discussion] Re: ndarray.fill and ma.array.filled
Tim Hochberg
tim.hochberg at cox.net
Fri Apr 7 14:16:06 EDT 2006
Eric Firing wrote:
> Sasha wrote:
>
>>
>>
>> On 4/7/06, *Tim Hochberg* <tim.hochberg at cox.net
>> <mailto:tim.hochberg at cox.net>> wrote:
>>
>> ...
>> In general, I'm skeptical of adding more methods to the ndarray
>> object
>> -- there are plenty already.
>>
>>
>> I've also proposed to drop "fill" in favor of optimizing x[...] =
>> <scalar>. Having both "fill" and "filled" in the interface is plain
>> awkward. You may like the combined proposal better because it does
>> not change the total number of methods :-)
>>
>>
>> In addition, it appears that both the method and function
>> versions of
>> filled are "dangerous" in the sense that they sometimes return the
>> array
>> itself and sometimes a copy.
>>
>>
>> This is true in ma, but may certainly be changed.
>>
>>
>> Finally, changing ndarray to support masked array feels a bit
>> like the
>> tail wagging the dog.
>>
>> I disagree. Numpy is pretty much alone among the array languages
>> because it does not have "native" support for missing values. For
>> the floating point types some rudimental support for nans exists,
>> but is not really usable. There is no missing values machanism for
>> integer types. I believe adding "filled" and maybe "mask" to ndarray
>> (not necessarily under these names) could be a meaningful step
>> towards "native" support for missing values.
>
>
> I agree strongly with you, Sasha. I get the impression that the world
> of numerical computation is divided into those who work with idealized
> "data", where nothing is missing, and those who work with real
> observations, where there is always something missing.
I think your experience is clouding your judgement here. Or at least
this comes off as unnecessarily perjorative. There's a large class of
people who work with data that doesn't have missing values either
because of the nature of data acquisition or because they're doing
simulations. I take zillions of measurements with digital oscillopscopes
and they *never* have missing values. Clipped values, yes, but even if I
somehow could queery the scope about which values were actually clipped
or simply make an educated guess based on their value, the facilities of
ma would be useless to me. The clipped values are what I would want in
any case. I also do a lot of work with simulations derived from this
and other data. I don't come across missing values here but again, if I
did, the way ma works would not help me. I'd have to treat them either
by rejecting the data outright or by some sort of interpolation.
> As an oceanographer, I am solidly in the latter category. If good
> support for missing values is not built in, it has to be bolted on,
> and it becomes clunky and awkward.
This may be a false dichotomy. It's certainly not obvious to me that
this is so. At least if "bolted on" means "not adding a filled method to
ndarray".
> I was reluctant to speak up about this earlier because I thought it
> was too much to ask of Travis when he was in the midst of putting
> numpy on solid ground. But I am delighted that missing value support
> has a champion among numpy developers, and I agree that now is the
> time to change it from "bolted on" to "integrated".
I have no objection to ma support improving. In fact I think it would be
great although I don't forsee it helping me anytime soon. I also support
Sasha's goal of being able to mix MaskedArrays and ndarrays reasonably
seemlessly.
However, I do think the situation needs more thought. Slapping filled
and mask onto ndarray is the path of least resistance, but it's not
clear that it's the best one.
If we do decide we are going to add both of these methods to ndarray
(with filled returning a copy!), then it may worth considering making
ndarray a subclass of MaskedArray. Conceptually this makes sense, since
at this point an ndarray will just be a MaskedArray where mask is always
False. I think that they could share much of the implementation except
that ndarray would be set up to use methods that ignored the mask
attribute since they would know that it's always false. Even that might
not be worth it, since the check for whether mask is True/False is just
a pointer compare.
It may in fact be best just to do away with MaskedArray entirely, moving
the functionality into ndarray. That may have performance implications,
although I don't seem them at the moment, and I don't know if there are
other methods/attributes that this would imply need to be moved over,
although it looks like just mask, filled and possibly filled_value,
although the latter looks a little dubious to me.
Either of the above two options would certainly improve the quality of
MaskedArray. Copy for instance seems not to have been implemented, and
who knows what other dark corners remain unexplored here.
There's a whole spectrum of possibilities here from ones that don't
intrude on ndarray at all to ones that profoundly change it. Sasha's
suggestion looks like it's probably the simplest thing in the short
term, but I don't know that it's the best long term solution. I think it
needs more thought and discussion, which is after all what Sasha asked
for ;)
Regards,
-tim
More information about the NumPy-Discussion
mailing list