[Numpy-discussion] new MaskedArray class

Marten van Kerkwijk m.h.vankerkwijk at gmail.com
Sat Jun 22 11:50:40 EDT 2019


Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It
indeed made an effort not to overwrite masked values with a new result,
even to the extend of copying back masked input data elements to the output
data array after an operation. But the fact that this is non-sensical if
the dtype changes (or the units in an operation on quantities) suggests
that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is
that all it does is forward any operations to the data it holds and
separately propagate a mask, ORing elements together for binary operations,
etc., and explicitly skipping masked elements in reductions (ideally using
`where` to be as agnostic as possible about the underlying data, for which,
e.g., setting masked values to `0` for `np.reduce.add` may or may not be
the right thing to do - what if they are string?).

With this mental picture, the underlying data are always have well-defined
meaning: they have been operated on as if the mask did not exist. There
then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a
sigma-clipping average routine, which uses a `MaskedArray` to iteratively
mask items that are too far off from the mean; here, the mask varies each
iteration (an initially masked element can come back into play), but the
data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane at gmail.com>
wrote:

> On 6/21/19 2:37 PM, Benjamin Root wrote:
> > Just to note, data that is masked isn't always garbage. There are plenty
> > of use-cases where one may want to temporarily apply a mask for a set of
> > computation, or possibly want to apply a series of different masks to
> > the data. I haven't read through this discussion deeply enough, but is
> > this new class going to destroy underlying masked data? and will it be
> > possible to swap out masks?
> >
> > Cheers!
> > Ben Root
>
> Indeed my implementation currently feels free to clobber the data at
> masked positions and makes no guarantees not to.
>
> I'd like to try to support reasonable use-cases like yours though. A few
> thoughts:
>
> First, the old np.ma.MaskedArray explicitly does not promise to preserve
> masked values, with a big warning in the docs. I can't recall the
> examples, but I remember coming across cases where clobbering happens.
> So arguably your behavior was never supported, and perhaps this means
> that no-clobber behavior is difficult to reasonably support.
>
> Second, the old np.ma.MaskedArray avoids frequent clobbering by making
> lots of copies. Therefore, in most cases you will not lose any
> performance in my new MaskedArray relative to the old one by making an
> explicit copy yourself. I.e, is it problematic to have to do
>
>      >>> result = MaskedArray(data.copy(), trial_mask).sum()
>
> instead of
>
>      >>> marr.mask = trial_mask
>      >>> result = marr.sum()
>
> since they have similar performance?
>
> Third, in the old np.ma.MaskedArray masked positions are very often
> "effectively" clobbered, in the sense that they are not computed. For
> example, if you do "c = a+b", and then change the mask of c, the values
> at masked position of the result of (a+b) do not correspond to the sum
> of the masked values in a and b. Thus, by "unmasking" c you are exposing
> nonsense values, which to me seems likely to cause heisenbugs.
>
>
> In summary, by not making no-clobber guarantees and by strictly
> preventing exposure of nonsense values, I suspect that: 1. my new code
> is simpler and faster by avoiding lots of copies, and forces copies to
> be explicit in user code. 2. disallowing direct modification of the mask
> lowers the "API surface area" making people's MaskedArray code less
> buggy and easier to read: Exposure of nonsense values by "unmasking" is
> one less possibility to keep in mind.
>
> Best,
> Allan
>
>
> > On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane at gmail.com
> > <mailto:allanhaldane at gmail.com>> wrote:
> >
> >     On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> >     > Hi Allan,
> >     >
> >     > This is very impressive! I could get the tests that I wrote for my
> >     class
> >     > pass with yours using Quantity with what I would consider very
> minimal
> >     > changes. I only could not find a good way to unmask data (I like
> the
> >     > idea of setting the mask on some elements via `ma[item] = X`); is
> this
> >     > on purpose?
> >
> >     Yes, I want to make it difficult for the user to access the garbage
> >     values under the mask, which are often clobbered values. The only
> way to
> >     "remove" a masked value is by replacing it with a new non-masked
> value.
> >
> >
> >     > Anyway, it would seem easily at the point where I should comment
> >     on your
> >     > repository rather than in the mailing list!
> >
> >     To make further progress on this encapsulation idea I need a more
> >     complete ducktype to pass into MaskedArray to test, so that's what
> I'll
> >     work on next, when I have time. I'll either try to finish my
> >     ArrayCollection type, or try making a simple NDunit ducktype
> >     piggybacking on astropy's Unit.
> >
> >     Best,
> >     Allan
> >
> >
> >     >
> >     > All the best,
> >     >
> >     > Marten
> >     >
> >     >
> >     > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
> >     <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
> >     > <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>>
> >     wrote:
> >     >
> >     >     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
> >     >     >
> >     >     >
> >     >     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
> >     >     <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
> >     <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>
> >     >     > <mailto:allanhaldane at gmail.com
> >     <mailto:allanhaldane at gmail.com> <mailto:allanhaldane at gmail.com
> >     <mailto:allanhaldane at gmail.com>>>>
> >     >     wrote:
> >     >     > <snip>
> >     >     >
> >     >     >     > This may be too much to ask from the initializer, but,
> if
> >     >     so, it still
> >     >     >     > seems most useful if it is made as easy as possible to
> do,
> >     >     say, `class
> >     >     >     > MaskedQuantity(Masked, Quantity): <very few
> overrides>`.
> >     >     >
> >     >     >     Currently MaskedArray does not accept ducktypes as
> >     underlying
> >     >     arrays,
> >     >     >     but I think it shouldn't be too hard to modify it to do
> so.
> >     >     Good idea!
> >     >     >
> >     >     >
> >     >     > Looking back at my trial, I see that I also never got to
> >     duck arrays -
> >     >     > only ndarray subclasses - though I tried to make the code as
> >     >     agnostic as
> >     >     > possible.
> >     >     >
> >     >     > (Trial at
> >     >     >
> >     >
> >
> https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1
> )
> >     >     >
> >     >     >     I already partly navigated this mixin-issue in the
> >     >     >     "MaskedArrayCollection" class, which essentially does
> >     >     >     ArrayCollection(MaskedArray(array)), and only takes
> about 30
> >     >     lines of
> >     >     >     boilerplate. That's the backwards encapsulation order
> from
> >     >     what you want
> >     >     >     though.
> >     >     >
> >     >     >
> >     >     > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
> >     u.m,
> >     >     > mask=[True, False, False])` does indeed not have a `.unit`
> >     attribute
> >     >     > (and cannot represent itself...); I'm not at all sure that my
> >     >     method of
> >     >     > just creating a mixed class is anything but a recipe for
> >     disaster,
> >     >     though!
> >     >
> >     >     Based on your suggestion I worked on this a little today, and
> >     now my
> >     >     MaskedArray more easily encapsulates both ducktypes and ndarray
> >     >     subclasses (pushed to repo). Here's an example I got working
> >     with masked
> >     >     units using unyt:
> >     >
> >     >     [1]: from MaskedArray import X, MaskedArray, MaskedScalar
> >     >
> >     >     [2]: from unyt import m, km
> >     >
> >     >     [3]: import numpy as np
> >     >
> >     >     [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
> >     >
> >     >     [5]: uarr
> >     >
> >     >     MaskedArray([1., X , 3.])
> >     >     [6]: uarr + 1*m
> >     >
> >     >     MaskedArray([1.001, X    , 3.001])
> >     >     [7]: uarr.filled()
> >     >
> >     >     unyt_array([1., 0., 3.], 'km')
> >     >     [8]: np.concatenate([uarr, 2*uarr]).filled()
> >     >     unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
> >     >
> >     >     The catch is the ducktype/subclass has to rigorously follow
> >     numpy's
> >     >     indexing rules, including distinguishing 0d arrays from
> >     scalars. For now
> >     >     only I used unyt in the example above since it happens to be
> >     less strict
> >     >      about dimensionless operations than astropy.units which trips
> >     up my
> >     >     repr code. (see below for example with astropy.units). Note in
> >     the last
> >     >     line I lost the dimensions, but that is because unyt does not
> >     handle
> >     >     np.concatenate. To get that to work we need a true ducktype
> >     for units.
> >     >
> >     >     The example above doesn't expose the ".units" attribute
> >     outside the
> >     >     MaskedArray, and it doesn't print the units in the repr. But
> >     you can
> >     >     access them using "filled".
> >     >
> >     >     While I could make MaskedArray forward unknown attribute
> >     accesses to the
> >     >     encapsulated array, that seems a bit dangerous/bug-prone at
> first
> >     >     glance, so probably I want to require the user to make a
> >     MaskedArray
> >     >     subclass to do so. I've just started playing with that
> >     (probably buggy),
> >     >     and Ive attached subclass examples for astropy.unit and unyt,
> >     with some
> >     >     example output below.
> >     >
> >     >     Cheers,
> >     >     Allan
> >     >
> >     >
> >     >
> >     >     Example using the attached astropy unit subclass:
> >     >
> >     >         >>> from astropy.units import m, km, s
> >     >         >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
> >     >         >>> uarr
> >     >         MaskedQ([1., X , 1.], units=km)
> >     >         >>> uarr.units
> >     >         km
> >     >         >>> uarr + (1*m)
> >     >         MaskedQ([1.001, X    , 1.001], units=km)
> >     >         >>> uarr/(1*s)
> >     >         MaskedQ([1., X , 1.], units=km / s)
> >     >         >>> (uarr*(1*m))[1:]
> >     >         MaskedQ([X , 1.], units=km m)
> >     >         >>> np.add.outer(uarr, uarr)
> >     >         MaskedQ([[2., X , 2.],
> >     >                  [X , X , X ],
> >     >                  [2., X , 2.]], units=km)
> >     >         >>> print(uarr)
> >     >         [1. X  1.] km m
> >     >
> >     >     Cheers,
> >     >     Allan
> >     >
> >     >
> >     >     >     > Even if this impossible, I think it is conceptually
> useful
> >     >     to think
> >     >     >     > about what the masking class should do. My sense is
> that,
> >     >     e.g., it
> >     >     >     > should not attempt to decide when an operation
> >     succeeds or not,
> >     >     >     but just
> >     >     >     > "or together" input masks for regular, multiple-input
> >     functions,
> >     >     >     and let
> >     >     >     > the underlying arrays skip elements for reductions by
> >     using
> >     >     `where`
> >     >     >     > (hey, I did implement that for a reason... ;-). In
> >     >     particular, it
> >     >     >     > suggests one should not have things like domains and
> all
> >     >     that (I never
> >     >     >     > understood why `MaskedArray` did that). If one wants
> more,
> >     >     the class
> >     >     >     > should provide a method that updates the mask (a
> sensible
> >     >     default
> >     >     >     might
> >     >     >     > be `mask |= ~np.isfinite(result)` - here, the class
> >     being masked
> >     >     >     should
> >     >     >     > logically support ufuncs and functions, so it can
> >     decide what
> >     >     >     "isfinite"
> >     >     >     > means).
> >     >     >
> >     >     >     I agree it would be nice to remove domains. It would
> >     make life
> >     >     easier,
> >     >     >     and I could remove a lot of twiddly code! I kept it in
> >     for now to
> >     >     >     minimize the behavior changes from the old MaskedArray.
> >     >     >
> >     >     >
> >     >     > That makes sense. Could be separated out to a
> >     backwards-compatibility
> >     >     > class later.
> >     >     >
> >     >     >
> >     >     >     > In any case, I would think that a basic truth should
> >     be that
> >     >     >     everything
> >     >     >     > has a mask with a shape consistent with the data, so
> >     >     >     > 1. Each complex numbers has just one mask, and setting
> >     >     `a.imag` with a
> >     >     >     > masked array should definitely propagate the mask.
> >     >     >     > 2. For a masked array with structured dtype, I'd
> >     similarly say
> >     >     >     that the
> >     >     >     > default is for a mask to have the same shape as the
> array.
> >     >     But that
> >     >     >     > something like your collection makes sense for the case
> >     >     where one
> >     >     >     wants
> >     >     >     > to mask items in a structure.
> >     >     >
> >     >     >     Agreed that we should have a single bool per complex or
> >     structured
> >     >     >     element, and the mask shape is the same as the array
> shape.
> >     >     That's how I
> >     >     >     implemented it. But there is still a problem with
> >     complex.imag
> >     >     >     assignment:
> >     >     >
> >     >     >         >>> a = MaskedArray([1j, 2, X])
> >     >     >         >>> i = a.imag
> >     >     >         >>> i[:] = MaskedArray([1, X, 1])
> >     >     >
> >     >     >     If we make the last line copy the mask to the original
> >     array, what
> >     >     >     should the real part of a[2] be? Conversely, if we don't
> >     copy
> >     >     the mask,
> >     >     >     what should the imag part of a[1] be? It seems like we
> might
> >     >     "want" the
> >     >     >     masks to be OR'd instead, but then should i[2] be masked
> >     after
> >     >     we just
> >     >     >     set it to 1?
> >     >     >
> >     >     > Ah, I see the issue now... Easiest to implement and closest
> >     in analogy
> >     >     > to a regular view would be to just let it unmask a[2] (with
> >     >     whatever is
> >     >     > in real; user beware!).
> >     >     >
> >     >     > Perhaps better would be to special-case such that `imag`
> >     returns a
> >     >     > read-only view of the mask. Making `imag` itself read-only
> would
> >     >     prevent
> >     >     > possibly reasonable things like `i[np.isclose(i, 0)] = 0` -
> but
> >     >     there is
> >     >     > no reason this should update the mask.
> >     >     >
> >     >     > Still, neither is really satisfactory...
> >     >     >
> >     >     >
> >     >     >
> >     >     >     > p.s. I started trying to implement the above "Mixin"
> >     class; will
> >     >     >     try to
> >     >     >     > clean that up a bit so that at least it uses `where`
> and
> >     >     push it up.
> >     >     >
> >     >     >     I played with "where", but didn't include it since 1.17
> >     is not
> >     >     released.
> >     >     >     To avoid duplication of effort, I've attached a diff of
> >     what I
> >     >     tried. I
> >     >     >     actually get a slight slowdown of about 10% by using
> >     where...
> >     >     >
> >     >     >
> >     >     > Your implementation is indeed quite similar to what I got in
> >     >     > __array_ufunc__ (though one should "&" the where with ~mask).
> >     >     >
> >     >     > I think the main benefit is not to presume that whatever is
> >     underneath
> >     >     > understands 0 or 1, i.e., avoid filling.
> >     >     >
> >     >     >
> >     >     >     If you make progress with the mixin, a push is welcome. I
> >     >     imagine a
> >     >     >     problem is going to be that np.isscalar doesn't work to
> >     detect
> >     >     duck
> >     >     >     scalars.
> >     >     >
> >     >     > I fear that in my attempts I've simply decided that only
> >     array scalars
> >     >     > exist...
> >     >     >
> >     >     > -- Marten
> >     >     >
> >     >     > _______________________________________________
> >     >     > NumPy-Discussion mailing list
> >     >     > NumPy-Discussion at python.org
> >     <mailto:NumPy-Discussion at python.org>
> >     <mailto:NumPy-Discussion at python.org
> >     <mailto:NumPy-Discussion at python.org>>
> >     >     > https://mail.python.org/mailman/listinfo/numpy-discussion
> >     >     >
> >     >
> >     >     _______________________________________________
> >     >     NumPy-Discussion mailing list
> >     >     NumPy-Discussion at python.org
> >     <mailto:NumPy-Discussion at python.org>
> >     <mailto:NumPy-Discussion at python.org
> >     <mailto:NumPy-Discussion at python.org>>
> >     >     https://mail.python.org/mailman/listinfo/numpy-discussion
> >     >
> >     >
> >     > _______________________________________________
> >     > NumPy-Discussion mailing list
> >     > NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
> >     > https://mail.python.org/mailman/listinfo/numpy-discussion
> >     >
> >
> >     _______________________________________________
> >     NumPy-Discussion mailing list
> >     NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
> >     https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190622/09795243/attachment-0001.html>


More information about the NumPy-Discussion mailing list