[Numpy-discussion] NA masks in the next numpy release?

Benjamin Root ben.root at ou.edu
Fri Oct 28 14:16:52 EDT 2011


On Fri, Oct 28, 2011 at 12:39 PM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Thu, Oct 27, 2011 at 10:56 PM, Benjamin Root <ben.root at ou.edu> wrote:
> >
> >
> > On Thursday, October 27, 2011, Charles R Harris <
> charlesr.harris at gmail.com>
> > wrote:
> >>
> >>
> >> On Thu, Oct 27, 2011 at 7:16 PM, Travis Oliphant <
> oliphant at enthought.com>
> >> wrote:
> >>>
> >>> That is a pretty good explanation.   I find myself convinced by
> Matthew's
> >>> arguments.    I think that being able to separate ABSENT from IGNORED
> is a
> >>> good idea.   I also like being able to control SKIP and PROPAGATE (but
> I
> >>> think the current implementation allows this already).
> >>>
> >>> What is the counter-argument to this proposal?
> >>>
> >>
> >> What exactly do you find convincing? The current masks propagate by
> >> default:
> >>
> >> In [1]: a = ones(5, maskna=1)
> >>
> >> In [2]: a[2] = NA
> >>
> >> In [3]: a
> >> Out[3]: array([ 1.,  1.,  NA,  1.,  1.])
> >>
> >> In [4]: a + 1
> >> Out[4]: array([ 2.,  2.,  NA,  2.,  2.])
> >>
> >> In [5]: a[2] = 10
> >>
> >> In [5]: a
> >> Out[5]: array([  1.,   1.,  10.,   1.,   1.], maskna=True)
> >>
> >>
> >> I don't see an essential difference between the implementation using
> masks
> >> and one using bit patterns, the mask when attached to the original array
> >> just adds a bit pattern by extending all the types by one byte, an
> approach
> >> that easily extends to all existing and future types, which is why Mark
> went
> >> that way for the first implementation given the time available. The
> masks
> >> are hidden because folks wanted something that behaved more like R and
> also
> >> because of the desire to combine the missing, ignore, and later possibly
> bit
> >> patterns in a unified manner. Note that the pseudo assignment was also
> meant
> >> to look like R. Adding true bit patterns to numpy isn't trivial and I
> >> believe Mark was thinking of parametrized types for that.
> >>
> >> The main problems I see with masks are unified storage and possibly
> memory
> >> use. The rest is just behavor and desired API and that can be adjusted
> >> within the current implementation. There is nothing essentially masky
> about
> >> masks.
> >>
> >> Chuck
> >>
> >>
> >
> > I  think chuck sums it up quite nicely.  The implementation detail about
> > using mask versus bit patterns can still be discussed and addressed.
> > Personally, I just don't see how parameterized dtypes would be easier to
> use
> > than the pseudo assignment.
> >
> > The elegance of mark's solution was to consider the treatment of missing
> > data in a unified manner.  This puts missing data in a more prominent
> spot
> > for extension builders, which should greatly improve support throughout
> the
> > ecosystem.
>
> Are extension builders then required to use the numpy C API to get
> their data?  Speaking as an extension builder, I would rather you gave
> me the mask and the bitpattern information and let me do that myself.
>
>
Forgive me, I wasn't clear.  What I am speaking of is more about a typical
human failing.  If a programmer for a module never encounters masked arrays,
then when they code up a function to operate on numpy data, it is quite
likely that they would never take it into consideration.  Notice the
prolific use of "np.asarray()" even within the numpy codebase, which
destroys masked arrays.

However, by making missing data support more integral into the core of
numpy, then it is far more likely that a programmer would take it into
consideration when designing their algorithm, or at least explicitly
document that their module does not support missing data.  Both NEPs does
this by making missing data front-and-center.  However, my belief is that
Mark's approach is easier to comprehend and is cleaner.  Cleaner features
means that it is more likely to be used.



> > By letting there be a single missing data framework (instead of
> > two) all that users need to figure out is when they want nan-like
> behavior
> > (propagate) or to be more like masks (skip).  Numpy takes care of the
> rest.
> >  There is a reason why I like using masked arrays because I don't have to
> > use nansum in my library functions to guard against the possibility of
> > receiving nans.  Duck-typing is a good thing.
> >
> > My argument against separating IGNORE and PROPAGATE is that it becomes
> too
> > tempting to want to mix these in an array, but the desired behavior would
> > likely become ambiguous..
> >
> > There is one other proplem that I just thought of that I don't think has
> > been outlined in either NEP.  What if I perform an operation between an
> > array set up with propagate NAs and an array with skip NAs?
>
> These are explicitly covered in the alterNEP:
>
> https://gist.github.com/1056379/
>
>
Sort of.  You speak of reduction operations for a single array with a mix of
NA and IGNOREs.  I guess in that case, it wouldn't make a difference for
element-wise operations between two arrays (plus adding the NAs propagate
harder rule).  Although, what if skipna=True?  I guess I would feel better
seeing explicit examples for different combinations of settings (plus, how
would one set those for math operators?).  In this case, I have a problem
with this mixed situation.  I would think that IGNORE + NA = IGNORE, because
if you are skipping it, then it is skipped, regardless of the other side of
the operator.  (precedence: a masked array summed against an array of NANs).

Looking back over Mark's NEP, I see he does cover the issue I am talking
about: "The design of this NEP does not distinguish between NAs that come
from an NA mask or NAs that come from an NA dtype. Both of these get treated
equivalently in computations, with masks dominating over NA dtypes".
However, he goes on about the possibility of multi-NA being able to control
the effects more directly.

Cheers,
Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111028/06db964f/attachment.html>


More information about the NumPy-Discussion mailing list