[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 14:50:32 EDT 2011

On 06/25/2011 07:05 AM, Nathaniel Smith wrote:
> On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett<matthew.brett at gmail.com>  wrote:
>> So far I see the difference between 1) and 2) being that you cannot
>> unmask.  So, if you didn't even know you could unmask data, then it
>> would not matter that 1) was being implemented by masks?
>
> I guess that is a difference, but I'm trying to get at something more
> fundamental -- not just what operations are allowed, but what
> operations people *expect* to be allowed. It seems like some of us
> have been talking past each other a lot, where someone says "but
> changing masks is the single most important feature!" and then someone
> else says "what are you talking about that doesn't even make sense".
>
>> To clarify, you're proposing for:
>>
>> a = np.sum(np.array([np.NA, np.NA])
>>
>> 1) ->  np.NA
>> 2) ->  0.0
>
> Yes -- and in R you get actually do get NA, while in numpy.ma you
> actually do get 0. I don't think this is a coincidence; I think it's

No, you don't:

In [2]: np.ma.array([2, 4], mask=[True, True]).sum()
Out[2]: masked

In [4]: np.sum(np.ma.array([2, 4], mask=[True, True]))
Out[4]: masked

Eric

> because they're designed as coherent systems that are trying to solve
> different problems. (Well, numpy.ma's "hardmask" idea seems inspired
> by the missing-data concept rather than the temporary-mask concept,
> but aside from that it seems pretty consistent in implementing option
> 2.)
>
> Here's another possible difference -- in (1), intuitively, missingness
> is a property of the data, so the logical place to put information
> about whether you can expect missing values is in the dtype, and to
> enable missing values you need to make a new array with a new dtype.
> (If we use a mask-based implementation, then
> np.asarray(nomissing_array, dtype=yesmissing_type) would still be able
> to skip making a copy of the data -- I'm talking ONLY about the
> interface here, not whether missing data has a different storage
> format from non-missing data.)
>
> In (2), the whole point is to use different masks with the same data,
> so I'd argue masking should be a property of the array object rather
> than the dtype, and the interface should logically allow masks to be
> created, modified, and destroyed in place.
>
> They're both internally consistent, but I think we might have to make
> a decision and stick to it.
>
>> I agree it's good to separate the API from the implementation.   I
>> think the implementation is also important because I care about memory
>> and possibly speed.  But, that is a separate problem from the API...
>
> Yes, absolutely memory and speed are important. But a really fast
> solution to the wrong problem isn't so useful either :-).
>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion