[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Thu Jun 23 20:56:12 EDT 2011

On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM <pgmdevlist at gmail.com> wrote:

> Sorry y'all, I'm just commenting bits by bits:
>
> "One key problem is a lack of orthogonality with other features, for
> instance creating a masked array with physical quantities can't be done
> because both are separate subclasses of ndarray. The only reasonable way to
> deal with this is to move the mask into the core ndarray."
>
> Meh. I did try to make it easy to use masked arrays on top of subclasses.
> There's even some tests in the suite to that effect (test_subclassing). I'm
> not buying the argument.
> About moving mask in the core ndarray: I had suggested back in the days to
> have a mask flag/property built-in ndarrays (which would *really* have
> simplified the game), but this suggestion  was dismissed very quickly as
> adding too much overload. I had to agree. I'm just a tad surprised the wind
> has changed on that matter.
>
>
> "In the current masked array, calculations are done for the whole array,
> then masks are patched up afterwords. This means that invalid calculations
> sitting in masked elements can raise warnings or exceptions even though they
> shouldn't, so the ufunc error handling mechanism can't be relied on."
>
> Well, there's a reason for that. Initially, I tried to guess what the mask
> of the output should be from the mask of the inputs, the objective being to
> avoid getting NaNs in the C array. That was easy in most cases,  but it
> turned out it wasn't always possible (the `power` one caused me a lot of
> issues, if I recall correctly). So, for performance issues (to avoid a lot
> of expensive tests), I fell back on the old concept of "compute them all,
> they'll be sorted afterwards".
> Of course, that's rather clumsy an approach. But it works not too badly
> when in pure Python. No doubt that a proper C implementation would work
> faster.
> Oh, about using NaNs for invalid data ? Well, can't work with integers.
>
> `mask` property:
> Nothing to add to it. It's basically what we have now (except for the
> opposite convention).
>
> Working with masked values:
> I recall some strong points back in the days for not using None to
> represent missing values...
> Adding a maskedstr argument to array2string ? Mmh... I prefer a global flag
> like we have now.
>
> Design questions:
> Adding `masked` or whatever we call it to a number/array should result is
> masked/a fully masked array, period. That way, we can have an idea that
> something was wrong with the initial dataset.
> hardmask: I never used the feature myself. I wonder if anyone did. Still,
> it's a nice idea...
>

As a heavy masked_array user, I regret not being able to participate more in
this discussion as I am madly cranking out matplotlib code.  I would like to
say that I have always seen masked arrays as being the "next step up" from
using arrays with NaNs.  The hardmask/softmask/sharedmasked concepts are
powerful, and I don't think they have yet to be exploited to their fullest
potential.

Masks are (relatively) easy when dealing with element-by-element operations
that produces an array of the same shape (or at least the same number of
elements in the case of reshape and transpose).  What gets difficult is for
reductions such as sum or max, etc.  Then you get into the weirder cases
such as unwrap and gradients that I brought up recently.  I am not sure how
to address this, but I am not a fan of the idea of adding yet another
parameter to the ufuncs to determine what to do for filling in a mask.

Also, just to make things messier, there is an incomplete feature that was
made for record arrays with regards to masking.  The idea was to allow for
element-by-element masking, but also allow for row-by-row (or was it
column-by-column?) masking.  I thought it was a neat feature, and it is too
bad that it was not finished.

Anyway, my opinion is that a mask should be True for a value that needs to
be hidden.  Do not change this convention.  People coming into python
already has to change code, a simple bit flip for them should be fine.
Breaking existing python code is worse.

I also don't see it as entirely necessary for *all* of masked arrays to be
brought into numpy core.  Only the most important parts/hooks need  to be.
We could then still have a masked array class that provides the finishing
touches such as the sharing of masks and special masked related functions.

Lastly, I am not entirely familiar with R, so I am also very curious about
what this magical "NA" value is, and how it compares to how NaNs work.
Although, Pierre brought up the very good point that NaNs woulldn't work
anyway with integer arrays (and object arrays, etc.).

Back to toiling on matplotlib,
Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110623/9d9b0458/attachment.html>