[Numpy-discussion] new MaskedArray class

Mon Jun 24 11:46:02 EDT 2019

On 6/22/19 11:50 AM, Marten van Kerkwijk wrote:
> Hi Allan,
> 
> I'm not sure I would go too much by what the old MaskedArray class did. 
> It indeed made an effort not to overwrite masked values with a new 
> result, even to the extend of copying back masked input data elements to 
> the output data array after an operation. But the fact that this is 
> non-sensical if the dtype changes (or the units in an operation on 
> quantities) suggests that this mental model simply does not work.
> 
> I think a sensible alternative mental model for the MaskedArray class is 
> that all it does is forward any operations to the data it holds and 
> separately propagate a mask, ORing elements together for binary 
> operations, etc., and explicitly skipping masked elements in reductions 
> (ideally using `where` to be as agnostic as possible about the 
> underlying data, for which, e.g., setting masked values to `0` for 
> `np.reduce.add` may or may not be the right thing to do - what if they 
> are string? >
> With this mental picture, the underlying data are always have 
> well-defined meaning: they have been operated on as if the mask did not 
> exist. There then is also less reason to try to avoid getting it back to 
> the user.
> 
> As a concrete example (maybe Ben has others): in astropy we have a 
> sigma-clipping average routine, which uses a `MaskedArray` to 
> iteratively mask items that are too far off from the mean; here, the 
> mask varies each iteration (an initially masked element can come back 
> into play), but the data do not.
> 
> All the best,
> 
> Marten

I want to distinguish 3 different behaviors we are discussing:

 1. Making a "no-clobber" guarantee on the underlying data
 2. whether the data under the mask should have "sensible" values
 3. whether to allow "unmasking"

1. For "no clobber"
===================

I agree this would be a nice guarantee to make, as long as it does not
impose too much burden on us. Sometimes it is better not to chain
ourselves down.

By using "where", I can indeed make many api-functions and ufuncs
guarantee no-clobber. There are still a bunch of functions that seem
tricky to implement currently without either clobbering or making a
copy: dot, cross, outer, sort/argsort/searchsorted, correlate, convolve,
nonzero, and similar functions.

I'll think about how to implement those in a no-clobber way. Any
suggestions welcome, eg for np.dot/outer.

A no-clobber guarantee makes your "iterative mask" example solvable in
an efficient (no-copy) way:

    mask, last_mask = False
    while True:
        dat_mean = np.mean(MaskedArray(data, mask))
        mask, last_mask = np.abs(data - mask) > cutoff, mask
        if np.all(mask == last_mask):
            break

The MaskedArray constructor should have pretty minimal overhead.

2. Whether we can make "masked" data keep sensible values
=========================================================

I'm not confident this is a good guarantee to make. Certainly in a
simple case like c = a + b we can make masked values in c contain the
correct sum of the masked data in a and b. But I foresee complications
in other cases. For instance,

    MaskedArray([4,4,4])/MaskedArray([0, 1, 2], mask=[1,0,1])

If we use the "where" ufunc argument to evaluate the division operation,
a division-by-0 warning is not output which is good. However, if we use
"where" index 2 does not get updated correctly and will contain
"nonsense". If we use "where" a lot (which I think we should) we can
expect a lot of uninitialized masked values to commonly appear.

So my interpretation is that this comment:

> I think a sensible alternative mental model for the MaskedArray class
is that all it does is forward any operations to the data it holds and
separately propagate a mask

should only roughly be true, in the sense that we will not simply
"forward any operations" but we will also use "where" arguments which
produce nonsense masked values.

3. Whether to allow unmasking
=============================

If we agree that masked values will contain nonsense, it seems like a
bad idea for those values to be easily exposed.

Further, in all the comments so far I have not seen an example of a need
for unmasking that is not more easily, efficiently and safely achieved
by simply creating a new MaskedArray with a different mask.

If super-users want to access the ._data attribute they can, but I don't
think it should be recommended.

Normal users can use the ".filled" method, which by the way I
implemented to optionally support returning a readonly view rather than
a copy (the default).

Cheers,
Allan

> 
> On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane at gmail.com 
> <mailto:allanhaldane at gmail.com>> wrote:
> 
>     On 6/21/19 2:37 PM, Benjamin Root wrote:
>      > Just to note, data that is masked isn't always garbage. There are
>     plenty
>      > of use-cases where one may want to temporarily apply a mask for a
>     set of
>      > computation, or possibly want to apply a series of different masks to
>      > the data. I haven't read through this discussion deeply enough,
>     but is
>      > this new class going to destroy underlying masked data? and will
>     it be
>      > possible to swap out masks?
>      >
>      > Cheers!
>      > Ben Root
> 
>     Indeed my implementation currently feels free to clobber the data at
>     masked positions and makes no guarantees not to.
> 
>     I'd like to try to support reasonable use-cases like yours though. A
>     few
>     thoughts:
> 
>     First, the old np.ma.MaskedArray explicitly does not promise to
>     preserve
>     masked values, with a big warning in the docs. I can't recall the
>     examples, but I remember coming across cases where clobbering happens.
>     So arguably your behavior was never supported, and perhaps this means
>     that no-clobber behavior is difficult to reasonably support.
> 
>     Second, the old np.ma.MaskedArray avoids frequent clobbering by making
>     lots of copies. Therefore, in most cases you will not lose any
>     performance in my new MaskedArray relative to the old one by making an
>     explicit copy yourself. I.e, is it problematic to have to do
> 
>           >>> result = MaskedArray(data.copy(), trial_mask).sum()
> 
>     instead of
> 
>           >>> marr.mask = trial_mask
>           >>> result = marr.sum()
> 
>     since they have similar performance?
> 
>     Third, in the old np.ma.MaskedArray masked positions are very often
>     "effectively" clobbered, in the sense that they are not computed. For
>     example, if you do "c = a+b", and then change the mask of c, the values
>     at masked position of the result of (a+b) do not correspond to the sum
>     of the masked values in a and b. Thus, by "unmasking" c you are
>     exposing
>     nonsense values, which to me seems likely to cause heisenbugs.
> 
> 
>     In summary, by not making no-clobber guarantees and by strictly
>     preventing exposure of nonsense values, I suspect that: 1. my new code
>     is simpler and faster by avoiding lots of copies, and forces copies to
>     be explicit in user code. 2. disallowing direct modification of the
>     mask
>     lowers the "API surface area" making people's MaskedArray code less
>     buggy and easier to read: Exposure of nonsense values by "unmasking" is
>     one less possibility to keep in mind.
> 
>     Best,
>     Allan
> 
> 
>      > On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane
>     <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
>      > <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>>
>     wrote:
>      >
>      >     On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
>      >     > Hi Allan,
>      >     >
>      >     > This is very impressive! I could get the tests that I wrote
>     for my
>      >     class
>      >     > pass with yours using Quantity with what I would consider
>     very minimal
>      >     > changes. I only could not find a good way to unmask data (I
>     like the
>      >     > idea of setting the mask on some elements via `ma[item] =
>     X`); is this
>      >     > on purpose?
>      >
>      >     Yes, I want to make it difficult for the user to access the
>     garbage
>      >     values under the mask, which are often clobbered values. The
>     only way to
>      >     "remove" a masked value is by replacing it with a new
>     non-masked value.
>      >
>      >
>      >     > Anyway, it would seem easily at the point where I should
>     comment
>      >     on your
>      >     > repository rather than in the mailing list!
>      >
>      >     To make further progress on this encapsulation idea I need a more
>      >     complete ducktype to pass into MaskedArray to test, so that's
>     what I'll
>      >     work on next, when I have time. I'll either try to finish my
>      >     ArrayCollection type, or try making a simple NDunit ducktype
>      >     piggybacking on astropy's Unit.
>      >
>      >     Best,
>      >     Allan
>      >
>      >
>      >     >
>      >     > All the best,
>      >     >
>      >     > Marten
>      >     >
>      >     >
>      >     > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
>      >     <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
>     <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>
>      >     > <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com> <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com>>>>
>      >     wrote:
>      >     >
>      >     >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>      >     >     >
>      >     >     >
>      >     >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>      >     >     <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
>     <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>
>      >     <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com> <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com>>>
>      >     >     > <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com>
>      >     <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com>> <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com>
>      >     <mailto:allanhaldane at gmail.com
>     <mailto:allanhaldane at gmail.com>>>>>
>      >     >     wrote:
>      >     >     > <snip>
>      >     >     >
>      >     >     >     > This may be too much to ask from the
>     initializer, but, if
>      >     >     so, it still
>      >     >     >     > seems most useful if it is made as easy as
>     possible to do,
>      >     >     say, `class
>      >     >     >     > MaskedQuantity(Masked, Quantity): <very few
>     overrides>`.
>      >     >     >
>      >     >     >     Currently MaskedArray does not accept ducktypes as
>      >     underlying
>      >     >     arrays,
>      >     >     >     but I think it shouldn't be too hard to modify it
>     to do so.
>      >     >     Good idea!
>      >     >     >
>      >     >     >
>      >     >     > Looking back at my trial, I see that I also never got to
>      >     duck arrays -
>      >     >     > only ndarray subclasses - though I tried to make the
>     code as
>      >     >     agnostic as
>      >     >     > possible.
>      >     >     >
>      >     >     > (Trial at
>      >     >     >
>      >     >
>      >
>     https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1)
>      >     >     >
>      >     >     >     I already partly navigated this mixin-issue in the
>      >     >     >     "MaskedArrayCollection" class, which essentially does
>      >     >     >     ArrayCollection(MaskedArray(array)), and only
>     takes about 30
>      >     >     lines of
>      >     >     >     boilerplate. That's the backwards encapsulation
>     order from
>      >     >     what you want
>      >     >     >     though.
>      >     >     >
>      >     >     >
>      >     >     > Yes, indeed, from a quick trial
>     `MaskedArray(np.arange(3.) *
>      >     u.m,
>      >     >     > mask=[True, False, False])` does indeed not have a
>     `.unit`
>      >     attribute
>      >     >     > (and cannot represent itself...); I'm not at all sure
>     that my
>      >     >     method of
>      >     >     > just creating a mixed class is anything but a recipe for
>      >     disaster,
>      >     >     though!
>      >     >
>      >     >     Based on your suggestion I worked on this a little
>     today, and
>      >     now my
>      >     >     MaskedArray more easily encapsulates both ducktypes and
>     ndarray
>      >     >     subclasses (pushed to repo). Here's an example I got
>     working
>      >     with masked
>      >     >     units using unyt:
>      >     >
>      >     >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
>      >     >
>      >     >     [2]: from unyt import m, km
>      >     >
>      >     >     [3]: import numpy as np
>      >     >
>      >     >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
>      >     >
>      >     >     [5]: uarr
>      >     >
>      >     >     MaskedArray([1., X , 3.])
>      >     >     [6]: uarr + 1*m
>      >     >
>      >     >     MaskedArray([1.001, X    , 3.001])
>      >     >     [7]: uarr.filled()
>      >     >
>      >     >     unyt_array([1., 0., 3.], 'km')
>      >     >     [8]: np.concatenate([uarr, 2*uarr]).filled()
>      >     >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
>      >     >
>      >     >     The catch is the ducktype/subclass has to rigorously follow
>      >     numpy's
>      >     >     indexing rules, including distinguishing 0d arrays from
>      >     scalars. For now
>      >     >     only I used unyt in the example above since it happens
>     to be
>      >     less strict
>      >     >      about dimensionless operations than astropy.units
>     which trips
>      >     up my
>      >     >     repr code. (see below for example with astropy.units).
>     Note in
>      >     the last
>      >     >     line I lost the dimensions, but that is because unyt
>     does not
>      >     handle
>      >     >     np.concatenate. To get that to work we need a true ducktype
>      >     for units.
>      >     >
>      >     >     The example above doesn't expose the ".units" attribute
>      >     outside the
>      >     >     MaskedArray, and it doesn't print the units in the
>     repr. But
>      >     you can
>      >     >     access them using "filled".
>      >     >
>      >     >     While I could make MaskedArray forward unknown attribute
>      >     accesses to the
>      >     >     encapsulated array, that seems a bit
>     dangerous/bug-prone at first
>      >     >     glance, so probably I want to require the user to make a
>      >     MaskedArray
>      >     >     subclass to do so. I've just started playing with that
>      >     (probably buggy),
>      >     >     and Ive attached subclass examples for astropy.unit and
>     unyt,
>      >     with some
>      >     >     example output below.
>      >     >
>      >     >     Cheers,
>      >     >     Allan
>      >     >
>      >     >
>      >     >
>      >     >     Example using the attached astropy unit subclass:
>      >     >
>      >     >         >>> from astropy.units import m, km, s
>      >     >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
>      >     >         >>> uarr
>      >     >         MaskedQ([1., X , 1.], units=km)
>      >     >         >>> uarr.units
>      >     >         km
>      >     >         >>> uarr + (1*m)
>      >     >         MaskedQ([1.001, X    , 1.001], units=km)
>      >     >         >>> uarr/(1*s)
>      >     >         MaskedQ([1., X , 1.], units=km / s)
>      >     >         >>> (uarr*(1*m))[1:]
>      >     >         MaskedQ([X , 1.], units=km m)
>      >     >         >>> np.add.outer(uarr, uarr)
>      >     >         MaskedQ([[2., X , 2.],
>      >     >                  [X , X , X ],
>      >     >                  [2., X , 2.]], units=km)
>      >     >         >>> print(uarr)
>      >     >         [1. X  1.] km m
>      >     >
>      >     >     Cheers,
>      >     >     Allan
>      >     >
>      >     >
>      >     >     >     > Even if this impossible, I think it is
>     conceptually useful
>      >     >     to think
>      >     >     >     > about what the masking class should do. My
>     sense is that,
>      >     >     e.g., it
>      >     >     >     > should not attempt to decide when an operation
>      >     succeeds or not,
>      >     >     >     but just
>      >     >     >     > "or together" input masks for regular,
>     multiple-input
>      >     functions,
>      >     >     >     and let
>      >     >     >     > the underlying arrays skip elements for
>     reductions by
>      >     using
>      >     >     `where`
>      >     >     >     > (hey, I did implement that for a reason... ;-). In
>      >     >     particular, it
>      >     >     >     > suggests one should not have things like
>     domains and all
>      >     >     that (I never
>      >     >     >     > understood why `MaskedArray` did that). If one
>     wants more,
>      >     >     the class
>      >     >     >     > should provide a method that updates the mask
>     (a sensible
>      >     >     default
>      >     >     >     might
>      >     >     >     > be `mask |= ~np.isfinite(result)` - here, the class
>      >     being masked
>      >     >     >     should
>      >     >     >     > logically support ufuncs and functions, so it can
>      >     decide what
>      >     >     >     "isfinite"
>      >     >     >     > means).
>      >     >     >
>      >     >     >     I agree it would be nice to remove domains. It would
>      >     make life
>      >     >     easier,
>      >     >     >     and I could remove a lot of twiddly code! I kept
>     it in
>      >     for now to
>      >     >     >     minimize the behavior changes from the old
>     MaskedArray.
>      >     >     >
>      >     >     >
>      >     >     > That makes sense. Could be separated out to a
>      >     backwards-compatibility
>      >     >     > class later.
>      >     >     >
>      >     >     >
>      >     >     >     > In any case, I would think that a basic truth
>     should
>      >     be that
>      >     >     >     everything
>      >     >     >     > has a mask with a shape consistent with the
>     data, so
>      >     >     >     > 1. Each complex numbers has just one mask, and
>     setting
>      >     >     `a.imag` with a
>      >     >     >     > masked array should definitely propagate the mask.
>      >     >     >     > 2. For a masked array with structured dtype, I'd
>      >     similarly say
>      >     >     >     that the
>      >     >     >     > default is for a mask to have the same shape as
>     the array.
>      >     >     But that
>      >     >     >     > something like your collection makes sense for
>     the case
>      >     >     where one
>      >     >     >     wants
>      >     >     >     > to mask items in a structure.
>      >     >     >
>      >     >     >     Agreed that we should have a single bool per
>     complex or
>      >     structured
>      >     >     >     element, and the mask shape is the same as the
>     array shape.
>      >     >     That's how I
>      >     >     >     implemented it. But there is still a problem with
>      >     complex.imag
>      >     >     >     assignment:
>      >     >     >
>      >     >     >         >>> a = MaskedArray([1j, 2, X])
>      >     >     >         >>> i = a.imag
>      >     >     >         >>> i[:] = MaskedArray([1, X, 1])
>      >     >     >
>      >     >     >     If we make the last line copy the mask to the
>     original
>      >     array, what
>      >     >     >     should the real part of a[2] be? Conversely, if
>     we don't
>      >     copy
>      >     >     the mask,
>      >     >     >     what should the imag part of a[1] be? It seems
>     like we might
>      >     >     "want" the
>      >     >     >     masks to be OR'd instead, but then should i[2] be
>     masked
>      >     after
>      >     >     we just
>      >     >     >     set it to 1?
>      >     >     >
>      >     >     > Ah, I see the issue now... Easiest to implement and
>     closest
>      >     in analogy
>      >     >     > to a regular view would be to just let it unmask a[2]
>     (with
>      >     >     whatever is
>      >     >     > in real; user beware!).
>      >     >     >
>      >     >     > Perhaps better would be to special-case such that `imag`
>      >     returns a
>      >     >     > read-only view of the mask. Making `imag` itself
>     read-only would
>      >     >     prevent
>      >     >     > possibly reasonable things like `i[np.isclose(i, 0)]
>     = 0` - but
>      >     >     there is
>      >     >     > no reason this should update the mask.
>      >     >     >
>      >     >     > Still, neither is really satisfactory...
>      >     >     >
>      >     >     >
>      >     >     >
>      >     >     >     > p.s. I started trying to implement the above
>     "Mixin"
>      >     class; will
>      >     >     >     try to
>      >     >     >     > clean that up a bit so that at least it uses
>     `where` and
>      >     >     push it up.
>      >     >     >
>      >     >     >     I played with "where", but didn't include it
>     since 1.17
>      >     is not
>      >     >     released.
>      >     >     >     To avoid duplication of effort, I've attached a
>     diff of
>      >     what I
>      >     >     tried. I
>      >     >     >     actually get a slight slowdown of about 10% by using
>      >     where...
>      >     >     >
>      >     >     >
>      >     >     > Your implementation is indeed quite similar to what I
>     got in
>      >     >     > __array_ufunc__ (though one should "&" the where with
>     ~mask).
>      >     >     >
>      >     >     > I think the main benefit is not to presume that
>     whatever is
>      >     underneath
>      >     >     > understands 0 or 1, i.e., avoid filling.
>      >     >     >
>      >     >     >
>      >     >     >     If you make progress with the mixin, a push is
>     welcome. I
>      >     >     imagine a
>      >     >     >     problem is going to be that np.isscalar doesn't
>     work to
>      >     detect
>      >     >     duck
>      >     >     >     scalars.
>      >     >     >
>      >     >     > I fear that in my attempts I've simply decided that only
>      >     array scalars
>      >     >     > exist...
>      >     >     >
>      >     >     > -- Marten
>      >     >     >
>      >     >     > _______________________________________________
>      >     >     > NumPy-Discussion mailing list
>      >     >     > NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>
>      >     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>>
>      >     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>
>      >     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>>>
>      >     >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>      >     >     >
>      >     >
>      >     >     _______________________________________________
>      >     >     NumPy-Discussion mailing list
>      >     > NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>
>      >     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>>
>      >     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>
>      >     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>>>
>      >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>      >     >
>      >     >
>      >     > _______________________________________________
>      >     > NumPy-Discussion mailing list
>      >     > NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>
>     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>>
>      >     > https://mail.python.org/mailman/listinfo/numpy-discussion
>      >     >
>      >
>      >     _______________________________________________
>      >     NumPy-Discussion mailing list
>      > NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
>     <mailto:NumPy-Discussion at python.org
>     <mailto:NumPy-Discussion at python.org>>
>      > https://mail.python.org/mailman/listinfo/numpy-discussion
>      >
>      >
>      > _______________________________________________
>      > NumPy-Discussion mailing list
>      > NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
>      > https://mail.python.org/mailman/listinfo/numpy-discussion
>      >
> 
>     _______________________________________________
>     NumPy-Discussion mailing list
>     NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
>     https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>