[Numpy-discussion] missing data discussion round 2

Tue Jun 28 19:00:04 EDT 2011

Hi,

On Tue, Jun 28, 2011 at 11:40 PM, Jason Grout
<jason-sage at creativetrax.com> wrote:
> On 6/28/11 5:20 PM, Matthew Brett wrote:
>> Hi,
>>
>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs at pobox.com>  wrote:
>> ...
>>> (You might think, what difference does it make if you *can* unmask an
>>> item? Us missing data folks could just ignore this feature. But:
>>> whatever we end up implementing is something that I will have to
>>> explain over and over to different people, most of them not
>>> particularly sophisticated programmers. And there's just no sensible
>>> way to explain this idea that if you store some particular value, then
>>> it replaces the old value, but if you store NA, then the old value is
>>> still there.
>>
>> Ouch - yes.  No question, that is difficult to explain.   Well, I
>> think the explanation might go like this:
>>
>> "Ah, yes, well, that's because in fact numpy records missing values by
>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>>
>> Is that fair?
>
> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> the idea that the entry is still there, but we're just ignoring it.  Of
> course, that goes against common convention, but it might be easier to
> explain.

I think Nathaniel's point is that np.IGNORE is a different idea than
np.NA, and that is why joining the implementations can lead to
conceptual confusion.  For example, for:

a = np.array([np.NA, 1])

you might expect the result of a.sum() to be np.NA.  That's what it is
in R.  However for:

b = np.array([np.IGNORE, 1])

you'd probably expect b.sum() to be 1.  That's what it is for
masked_array currently.

The current proposal fuses these two ideas with one implementation.
Quoting from the NEP:

>>> a = np.array([1., 3., np.NA, 7.], masked=True)
>>> np.sum(a)
array(NA, dtype='<f8', masked=True)
>>> np.sum(a, skipna=True)
11.0

I agree with Nathaniel, that there is no practical way of avoiding the
full 'NAs are in fact values where theres a False in the mask'
concept, and that does impose a serious conceptual cost on the 'NA'
user.

Best,

Matthew