[Numpy-discussion] new MaskedArray class
Marten van Kerkwijk
m.h.vankerkwijk at gmail.com
Sat Jun 22 11:50:40 EDT 2019
Hi Allan,
I'm not sure I would go too much by what the old MaskedArray class did. It
indeed made an effort not to overwrite masked values with a new result,
even to the extend of copying back masked input data elements to the output
data array after an operation. But the fact that this is non-sensical if
the dtype changes (or the units in an operation on quantities) suggests
that this mental model simply does not work.
I think a sensible alternative mental model for the MaskedArray class is
that all it does is forward any operations to the data it holds and
separately propagate a mask, ORing elements together for binary operations,
etc., and explicitly skipping masked elements in reductions (ideally using
`where` to be as agnostic as possible about the underlying data, for which,
e.g., setting masked values to `0` for `np.reduce.add` may or may not be
the right thing to do - what if they are string?).
With this mental picture, the underlying data are always have well-defined
meaning: they have been operated on as if the mask did not exist. There
then is also less reason to try to avoid getting it back to the user.
As a concrete example (maybe Ben has others): in astropy we have a
sigma-clipping average routine, which uses a `MaskedArray` to iteratively
mask items that are too far off from the mean; here, the mask varies each
iteration (an initially masked element can come back into play), but the
data do not.
All the best,
Marten
On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane at gmail.com>
wrote:
> On 6/21/19 2:37 PM, Benjamin Root wrote:
> > Just to note, data that is masked isn't always garbage. There are plenty
> > of use-cases where one may want to temporarily apply a mask for a set of
> > computation, or possibly want to apply a series of different masks to
> > the data. I haven't read through this discussion deeply enough, but is
> > this new class going to destroy underlying masked data? and will it be
> > possible to swap out masks?
> >
> > Cheers!
> > Ben Root
>
> Indeed my implementation currently feels free to clobber the data at
> masked positions and makes no guarantees not to.
>
> I'd like to try to support reasonable use-cases like yours though. A few
> thoughts:
>
> First, the old np.ma.MaskedArray explicitly does not promise to preserve
> masked values, with a big warning in the docs. I can't recall the
> examples, but I remember coming across cases where clobbering happens.
> So arguably your behavior was never supported, and perhaps this means
> that no-clobber behavior is difficult to reasonably support.
>
> Second, the old np.ma.MaskedArray avoids frequent clobbering by making
> lots of copies. Therefore, in most cases you will not lose any
> performance in my new MaskedArray relative to the old one by making an
> explicit copy yourself. I.e, is it problematic to have to do
>
> >>> result = MaskedArray(data.copy(), trial_mask).sum()
>
> instead of
>
> >>> marr.mask = trial_mask
> >>> result = marr.sum()
>
> since they have similar performance?
>
> Third, in the old np.ma.MaskedArray masked positions are very often
> "effectively" clobbered, in the sense that they are not computed. For
> example, if you do "c = a+b", and then change the mask of c, the values
> at masked position of the result of (a+b) do not correspond to the sum
> of the masked values in a and b. Thus, by "unmasking" c you are exposing
> nonsense values, which to me seems likely to cause heisenbugs.
>
>
> In summary, by not making no-clobber guarantees and by strictly
> preventing exposure of nonsense values, I suspect that: 1. my new code
> is simpler and faster by avoiding lots of copies, and forces copies to
> be explicit in user code. 2. disallowing direct modification of the mask
> lowers the "API surface area" making people's MaskedArray code less
> buggy and easier to read: Exposure of nonsense values by "unmasking" is
> one less possibility to keep in mind.
>
> Best,
> Allan
>
>
> > On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane at gmail.com
> > <mailto:allanhaldane at gmail.com>> wrote:
> >
> > On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> > > Hi Allan,
> > >
> > > This is very impressive! I could get the tests that I wrote for my
> > class
> > > pass with yours using Quantity with what I would consider very
> minimal
> > > changes. I only could not find a good way to unmask data (I like
> the
> > > idea of setting the mask on some elements via `ma[item] = X`); is
> this
> > > on purpose?
> >
> > Yes, I want to make it difficult for the user to access the garbage
> > values under the mask, which are often clobbered values. The only
> way to
> > "remove" a masked value is by replacing it with a new non-masked
> value.
> >
> >
> > > Anyway, it would seem easily at the point where I should comment
> > on your
> > > repository rather than in the mailing list!
> >
> > To make further progress on this encapsulation idea I need a more
> > complete ducktype to pass into MaskedArray to test, so that's what
> I'll
> > work on next, when I have time. I'll either try to finish my
> > ArrayCollection type, or try making a simple NDunit ducktype
> > piggybacking on astropy's Unit.
> >
> > Best,
> > Allan
> >
> >
> > >
> > > All the best,
> > >
> > > Marten
> > >
> > >
> > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
> > <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
> > > <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>>
> > wrote:
> > >
> > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
> > > >
> > > >
> > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
> > > <allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>
> > <mailto:allanhaldane at gmail.com <mailto:allanhaldane at gmail.com>>
> > > > <mailto:allanhaldane at gmail.com
> > <mailto:allanhaldane at gmail.com> <mailto:allanhaldane at gmail.com
> > <mailto:allanhaldane at gmail.com>>>>
> > > wrote:
> > > > <snip>
> > > >
> > > > > This may be too much to ask from the initializer, but,
> if
> > > so, it still
> > > > > seems most useful if it is made as easy as possible to
> do,
> > > say, `class
> > > > > MaskedQuantity(Masked, Quantity): <very few
> overrides>`.
> > > >
> > > > Currently MaskedArray does not accept ducktypes as
> > underlying
> > > arrays,
> > > > but I think it shouldn't be too hard to modify it to do
> so.
> > > Good idea!
> > > >
> > > >
> > > > Looking back at my trial, I see that I also never got to
> > duck arrays -
> > > > only ndarray subclasses - though I tried to make the code as
> > > agnostic as
> > > > possible.
> > > >
> > > > (Trial at
> > > >
> > >
> >
> https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1
> )
> > > >
> > > > I already partly navigated this mixin-issue in the
> > > > "MaskedArrayCollection" class, which essentially does
> > > > ArrayCollection(MaskedArray(array)), and only takes
> about 30
> > > lines of
> > > > boilerplate. That's the backwards encapsulation order
> from
> > > what you want
> > > > though.
> > > >
> > > >
> > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) *
> > u.m,
> > > > mask=[True, False, False])` does indeed not have a `.unit`
> > attribute
> > > > (and cannot represent itself...); I'm not at all sure that my
> > > method of
> > > > just creating a mixed class is anything but a recipe for
> > disaster,
> > > though!
> > >
> > > Based on your suggestion I worked on this a little today, and
> > now my
> > > MaskedArray more easily encapsulates both ducktypes and ndarray
> > > subclasses (pushed to repo). Here's an example I got working
> > with masked
> > > units using unyt:
> > >
> > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar
> > >
> > > [2]: from unyt import m, km
> > >
> > > [3]: import numpy as np
> > >
> > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
> > >
> > > [5]: uarr
> > >
> > > MaskedArray([1., X , 3.])
> > > [6]: uarr + 1*m
> > >
> > > MaskedArray([1.001, X , 3.001])
> > > [7]: uarr.filled()
> > >
> > > unyt_array([1., 0., 3.], 'km')
> > > [8]: np.concatenate([uarr, 2*uarr]).filled()
> > > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
> > >
> > > The catch is the ducktype/subclass has to rigorously follow
> > numpy's
> > > indexing rules, including distinguishing 0d arrays from
> > scalars. For now
> > > only I used unyt in the example above since it happens to be
> > less strict
> > > about dimensionless operations than astropy.units which trips
> > up my
> > > repr code. (see below for example with astropy.units). Note in
> > the last
> > > line I lost the dimensions, but that is because unyt does not
> > handle
> > > np.concatenate. To get that to work we need a true ducktype
> > for units.
> > >
> > > The example above doesn't expose the ".units" attribute
> > outside the
> > > MaskedArray, and it doesn't print the units in the repr. But
> > you can
> > > access them using "filled".
> > >
> > > While I could make MaskedArray forward unknown attribute
> > accesses to the
> > > encapsulated array, that seems a bit dangerous/bug-prone at
> first
> > > glance, so probably I want to require the user to make a
> > MaskedArray
> > > subclass to do so. I've just started playing with that
> > (probably buggy),
> > > and Ive attached subclass examples for astropy.unit and unyt,
> > with some
> > > example output below.
> > >
> > > Cheers,
> > > Allan
> > >
> > >
> > >
> > > Example using the attached astropy unit subclass:
> > >
> > > >>> from astropy.units import m, km, s
> > > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0])
> > > >>> uarr
> > > MaskedQ([1., X , 1.], units=km)
> > > >>> uarr.units
> > > km
> > > >>> uarr + (1*m)
> > > MaskedQ([1.001, X , 1.001], units=km)
> > > >>> uarr/(1*s)
> > > MaskedQ([1., X , 1.], units=km / s)
> > > >>> (uarr*(1*m))[1:]
> > > MaskedQ([X , 1.], units=km m)
> > > >>> np.add.outer(uarr, uarr)
> > > MaskedQ([[2., X , 2.],
> > > [X , X , X ],
> > > [2., X , 2.]], units=km)
> > > >>> print(uarr)
> > > [1. X 1.] km m
> > >
> > > Cheers,
> > > Allan
> > >
> > >
> > > > > Even if this impossible, I think it is conceptually
> useful
> > > to think
> > > > > about what the masking class should do. My sense is
> that,
> > > e.g., it
> > > > > should not attempt to decide when an operation
> > succeeds or not,
> > > > but just
> > > > > "or together" input masks for regular, multiple-input
> > functions,
> > > > and let
> > > > > the underlying arrays skip elements for reductions by
> > using
> > > `where`
> > > > > (hey, I did implement that for a reason... ;-). In
> > > particular, it
> > > > > suggests one should not have things like domains and
> all
> > > that (I never
> > > > > understood why `MaskedArray` did that). If one wants
> more,
> > > the class
> > > > > should provide a method that updates the mask (a
> sensible
> > > default
> > > > might
> > > > > be `mask |= ~np.isfinite(result)` - here, the class
> > being masked
> > > > should
> > > > > logically support ufuncs and functions, so it can
> > decide what
> > > > "isfinite"
> > > > > means).
> > > >
> > > > I agree it would be nice to remove domains. It would
> > make life
> > > easier,
> > > > and I could remove a lot of twiddly code! I kept it in
> > for now to
> > > > minimize the behavior changes from the old MaskedArray.
> > > >
> > > >
> > > > That makes sense. Could be separated out to a
> > backwards-compatibility
> > > > class later.
> > > >
> > > >
> > > > > In any case, I would think that a basic truth should
> > be that
> > > > everything
> > > > > has a mask with a shape consistent with the data, so
> > > > > 1. Each complex numbers has just one mask, and setting
> > > `a.imag` with a
> > > > > masked array should definitely propagate the mask.
> > > > > 2. For a masked array with structured dtype, I'd
> > similarly say
> > > > that the
> > > > > default is for a mask to have the same shape as the
> array.
> > > But that
> > > > > something like your collection makes sense for the case
> > > where one
> > > > wants
> > > > > to mask items in a structure.
> > > >
> > > > Agreed that we should have a single bool per complex or
> > structured
> > > > element, and the mask shape is the same as the array
> shape.
> > > That's how I
> > > > implemented it. But there is still a problem with
> > complex.imag
> > > > assignment:
> > > >
> > > > >>> a = MaskedArray([1j, 2, X])
> > > > >>> i = a.imag
> > > > >>> i[:] = MaskedArray([1, X, 1])
> > > >
> > > > If we make the last line copy the mask to the original
> > array, what
> > > > should the real part of a[2] be? Conversely, if we don't
> > copy
> > > the mask,
> > > > what should the imag part of a[1] be? It seems like we
> might
> > > "want" the
> > > > masks to be OR'd instead, but then should i[2] be masked
> > after
> > > we just
> > > > set it to 1?
> > > >
> > > > Ah, I see the issue now... Easiest to implement and closest
> > in analogy
> > > > to a regular view would be to just let it unmask a[2] (with
> > > whatever is
> > > > in real; user beware!).
> > > >
> > > > Perhaps better would be to special-case such that `imag`
> > returns a
> > > > read-only view of the mask. Making `imag` itself read-only
> would
> > > prevent
> > > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` -
> but
> > > there is
> > > > no reason this should update the mask.
> > > >
> > > > Still, neither is really satisfactory...
> > > >
> > > >
> > > >
> > > > > p.s. I started trying to implement the above "Mixin"
> > class; will
> > > > try to
> > > > > clean that up a bit so that at least it uses `where`
> and
> > > push it up.
> > > >
> > > > I played with "where", but didn't include it since 1.17
> > is not
> > > released.
> > > > To avoid duplication of effort, I've attached a diff of
> > what I
> > > tried. I
> > > > actually get a slight slowdown of about 10% by using
> > where...
> > > >
> > > >
> > > > Your implementation is indeed quite similar to what I got in
> > > > __array_ufunc__ (though one should "&" the where with ~mask).
> > > >
> > > > I think the main benefit is not to presume that whatever is
> > underneath
> > > > understands 0 or 1, i.e., avoid filling.
> > > >
> > > >
> > > > If you make progress with the mixin, a push is welcome. I
> > > imagine a
> > > > problem is going to be that np.isscalar doesn't work to
> > detect
> > > duck
> > > > scalars.
> > > >
> > > > I fear that in my attempts I've simply decided that only
> > array scalars
> > > > exist...
> > > >
> > > > -- Marten
> > > >
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > NumPy-Discussion at python.org
> > <mailto:NumPy-Discussion at python.org>
> > <mailto:NumPy-Discussion at python.org
> > <mailto:NumPy-Discussion at python.org>>
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org
> > <mailto:NumPy-Discussion at python.org>
> > <mailto:NumPy-Discussion at python.org
> > <mailto:NumPy-Discussion at python.org>>
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org <mailto:NumPy-Discussion at python.org>
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190622/09795243/attachment-0001.html>
More information about the NumPy-Discussion
mailing list