[Numpy-discussion] missing data discussion round 2

Mark Wiebe mwwiebe at gmail.com
Tue Jun 28 19:42:39 EDT 2011


On Tue, Jun 28, 2011 at 6:00 PM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Tue, Jun 28, 2011 at 11:40 PM, Jason Grout
> <jason-sage at creativetrax.com> wrote:
> > On 6/28/11 5:20 PM, Matthew Brett wrote:
> >> Hi,
> >>
> >> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs at pobox.com>  wrote:
> >> ...
> >>> (You might think, what difference does it make if you *can* unmask an
> >>> item? Us missing data folks could just ignore this feature. But:
> >>> whatever we end up implementing is something that I will have to
> >>> explain over and over to different people, most of them not
> >>> particularly sophisticated programmers. And there's just no sensible
> >>> way to explain this idea that if you store some particular value, then
> >>> it replaces the old value, but if you store NA, then the old value is
> >>> still there.
> >>
> >> Ouch - yes.  No question, that is difficult to explain.   Well, I
> >> think the explanation might go like this:
> >>
> >> "Ah, yes, well, that's because in fact numpy records missing values by
> >> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
> >> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
> >>
> >> Is that fair?
> >
> > Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> > the idea that the entry is still there, but we're just ignoring it.  Of
> > course, that goes against common convention, but it might be easier to
> > explain.
>
> I think Nathaniel's point is that np.IGNORE is a different idea than
> np.NA, and that is why joining the implementations can lead to
> conceptual confusion.  For example, for:
>
> a = np.array([np.NA, 1])
>
> you might expect the result of a.sum() to be np.NA.  That's what it is
> in R.  However for:
>
> b = np.array([np.IGNORE, 1])
>
> you'd probably expect b.sum() to be 1.  That's what it is for
> masked_array currently.
>
> The current proposal fuses these two ideas with one implementation.
> Quoting from the NEP:
>
> >>> a = np.array([1., 3., np.NA, 7.], masked=True)
> >>> np.sum(a)
> array(NA, dtype='<f8', masked=True)
> >>> np.sum(a, skipna=True)
> 11.0
>
> I agree with Nathaniel, that there is no practical way of avoiding the
> full 'NAs are in fact values where theres a False in the mask'
> concept, and that does impose a serious conceptual cost on the 'NA'
> user.
>

I'm not sure where the conceptual cost is coming from. If you're using
missing values with the masked array implementation, all you see are missing
value semantics. To see the additional masking behavior you have to deal
with more than one view of the same data at the same time, something that is
in and of itself already advanced for the novice user.

-Mark


>
> Best,
>
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110628/091d6edf/attachment.html>


More information about the NumPy-Discussion mailing list