[Numpy-discussion] missing data discussion round 2

Wed Jun 29 13:22:59 EDT 2011

On Wed, Jun 29, 2011 at 8:20 AM, Lluís <xscript at gmx.net> wrote:

> Matthew Brett writes:
>
> >> Maybe instead of np.NA, we could say np.IGNORE, which sort of conveys
> >> the idea that the entry is still there, but we're just ignoring it.  Of
> >> course, that goes against common convention, but it might be easier to
> >> explain.
>
> > I think Nathaniel's point is that np.IGNORE is a different idea than
> > np.NA, and that is why joining the implementations can lead to
> > conceptual confusion.
>
> This is how I see it:
>
> >>> a = np.array([0, 1, 2], dtype=int)
> >>> a[0] = np.NA
> ValueError
> >>> e = np.array([np.NA, 1, 2], dtype=int)
> ValueError
> >>> b  = np.array([np.NA, 1, 2], dtype=np.maybe(int))
> >>> m  = np.array([np.NA, 1, 2], dtype=int, masked=True)
> >>> bm = np.array([np.NA, 1, 2], dtype=np.maybe(int), masked=True)
> >>> b[1] = np.NA
> >>> np.sum(b)
> np.NA
> >>> np.sum(b, skipna=True)
> 2
> >>> b.mask
> None
> >>> m[1] = np.NA
> >>> np.sum(m)
> 2
> >>> np.sum(m, skipna=True)
> 2
> >>> m.mask
> [False, False, True]
> >>> bm[1] = np.NA
> >>> np.sum(bm)
> 2
> >>> np.sum(bm, skipna=True)
> 2
> >>> bm.mask
> [False, False, True]
>
> So:
>
> * Mask takes precedence over bit pattern on element assignment. There's
>  still the question of how to assign a bit pattern NA when the mask is
>  active.
>
> * When using mask, elements are automagically skipped.
>
> * "m[1] = np.NA" is equivalent to "m.mask[1] = False"
>
> * When using bit pattern + mask, it might make sense to have the initial
>  values as bit-pattern NAs, instead of masked (i.e., "bm.mask == [True,
>  False, True]" and "np.sum(bm) == np.NA")
>

There seems to be a general idea that masks and NA bit patterns imply
particular differing semantics, something which I think is simply false.
Both NaN and Inf are implemented in hardware with the same idea as the NA
bit pattern, but they do not follow NA missing value semantics.

As far as I can tell, the only required difference between them is that NA
bit patterns must destroy the data. Nothing else. Everything on top of that
is a choice of API and interface mechanisms. I want them to behave exactly
the same except for that necessary difference, so that it will be possible
to use the *exact same Python code* with either approach.

Say you're using NA dtypes, and suddenly you think, "what if I temporarily
treated these as NA too". Now you have to copy your whole array to avoid
destroying your data! The NA bit pattern didn't save you memory here... Say
you're using masks, and it turns out you didn't actually need masking
semantics. If they're different, you now have to do lots of code changes to
switch to NA dtypes!

-Mark

>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110629/966a2cdb/attachment.html>