[Numpy-discussion] missing data discussion round 2

Wed Jun 29 10:35:14 EDT 2011

On 06/29/2011 03:45 PM, Matthew Brett wrote:
> Hi,
>
> On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe<mwwiebe at gmail.com>  wrote:
>> On Tue, Jun 28, 2011 at 5:20 PM, Matthew Brett<matthew.brett at gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs at pobox.com>  wrote:
>>> ...
>>>> (You might think, what difference does it make if you *can* unmask an
>>>> item? Us missing data folks could just ignore this feature. But:
>>>> whatever we end up implementing is something that I will have to
>>>> explain over and over to different people, most of them not
>>>> particularly sophisticated programmers. And there's just no sensible
>>>> way to explain this idea that if you store some particular value, then
>>>> it replaces the old value, but if you store NA, then the old value is
>>>> still there.
>>>
>>> Ouch - yes.  No question, that is difficult to explain.   Well, I
>>> think the explanation might go like this:
>>>
>>> "Ah, yes, well, that's because in fact numpy records missing values by
>>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
>>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>>>
>>> Is that fair?
>>
>> My favorite way of explaining it would be to have a grid of numbers written
>> on paper, then have several cardboards with holes poked in them in different
>> configurations. Placing these cardboard masks in front of the grid would
>> show different sets of non-missing data, without affecting the values stored
>> on the paper behind them.
>
> Right - but here of course you are trying to explain the mask, and
> this is Nathaniel's point, that in order to explain NAs, you have to
> explain masks, and so, even at a basic level, the fusion of the two
> ideas is obvious, and already confusing.  I mean this:
>
> a[3] = np.NA
>
> "Oh, so you just set the a[3] value to have some missing value code?"
>
> "Ah - no - in fact what I did was set a associated mask in position
> a[3] so that you can't any longer see the previous value of a[3]"
>
> "Huh.  You mean I have a mask for every single value in order to be
> able to blank out a[3]?  It looks like an assignment.  I mean, it
> looks just like a[3] = 4.  But I guess it isn't?"
>
> "Er..."
>
> I think Nathaniel's point is a very good one - these are separate
> ideas, np.NA and np.IGNORE, and a joint implementation is bound to
> draw them together in the mind of the user.    Apart from anything
> else, the user has to know that, if they want a single NA value in an
> array, they have to add a mask size array.shape in bytes.  They have
> to know then, that NA is implemented by masking, and then the 'NA for
> free by adding masking' idea breaks down and starts to feel like a
> kludge.
>
> The counter argument is of course that, in time, the implementation of
> NA with masking will seem as obvious and intuitive, as, say,
> broadcasting, and that we are just reacting from lack of experience
> with the new API.

However, no matter how used we get to this, people coming from almost 
any other tool (in particular R) will keep think it is 
counter-intuitive. Why set up a major semantic incompatability that 
people then have to overcome in order to start using NumPy.

I really don't see what's wrong with some more explicit API like 
a.mask[3] = True. "Explicit is better than implicit".

Dag Sverre