[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 20:39:41 EDT 2011

On Fri, Jun 24, 2011 at 4:09 PM, Benjamin Root <ben.root at ou.edu> wrote:

>
>
> On Fri, Jun 24, 2011 at 10:40 AM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>
>> On Thu, Jun 23, 2011 at 7:56 PM, Benjamin Root <ben.root at ou.edu> wrote:
>>
>>> On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>>>
>>>> Sorry y'all, I'm just commenting bits by bits:
>>>>
>>>> "One key problem is a lack of orthogonality with other features, for
>>>> instance creating a masked array with physical quantities can't be done
>>>> because both are separate subclasses of ndarray. The only reasonable way to
>>>> deal with this is to move the mask into the core ndarray."
>>>>
>>>> Meh. I did try to make it easy to use masked arrays on top of
>>>> subclasses. There's even some tests in the suite to that effect
>>>> (test_subclassing). I'm not buying the argument.
>>>> About moving mask in the core ndarray: I had suggested back in the days
>>>> to have a mask flag/property built-in ndarrays (which would *really* have
>>>> simplified the game), but this suggestion  was dismissed very quickly as
>>>> adding too much overload. I had to agree. I'm just a tad surprised the wind
>>>> has changed on that matter.
>>>>
>>>>
>>>> "In the current masked array, calculations are done for the whole array,
>>>> then masks are patched up afterwords. This means that invalid calculations
>>>> sitting in masked elements can raise warnings or exceptions even though they
>>>> shouldn't, so the ufunc error handling mechanism can't be relied on."
>>>>
>>>> Well, there's a reason for that. Initially, I tried to guess what the
>>>> mask of the output should be from the mask of the inputs, the objective
>>>> being to avoid getting NaNs in the C array. That was easy in most cases,
>>>>  but it turned out it wasn't always possible (the `power` one caused me a
>>>> lot of issues, if I recall correctly). So, for performance issues (to avoid
>>>> a lot of expensive tests), I fell back on the old concept of "compute them
>>>> all, they'll be sorted afterwards".
>>>> Of course, that's rather clumsy an approach. But it works not too badly
>>>> when in pure Python. No doubt that a proper C implementation would work
>>>> faster.
>>>> Oh, about using NaNs for invalid data ? Well, can't work with integers.
>>>>
>>>> `mask` property:
>>>> Nothing to add to it. It's basically what we have now (except for the
>>>> opposite convention).
>>>>
>>>> Working with masked values:
>>>> I recall some strong points back in the days for not using None to
>>>> represent missing values...
>>>> Adding a maskedstr argument to array2string ? Mmh... I prefer a global
>>>> flag like we have now.
>>>>
>>>> Design questions:
>>>> Adding `masked` or whatever we call it to a number/array should result
>>>> is masked/a fully masked array, period. That way, we can have an idea that
>>>> something was wrong with the initial dataset.
>>>> hardmask: I never used the feature myself. I wonder if anyone did.
>>>> Still, it's a nice idea...
>>>>
>>>
>>> As a heavy masked_array user, I regret not being able to participate more
>>> in this discussion as I am madly cranking out matplotlib code.  I would like
>>> to say that I have always seen masked arrays as being the "next step up"
>>> from using arrays with NaNs.  The hardmask/softmask/sharedmasked concepts
>>> are powerful, and I don't think they have yet to be exploited to their
>>> fullest potential.
>>>
>>
>> Do you have some examples where hardmask or sharedmask are being used? I
>> like the idea of using a hardmask array as the return value for boolean
>> indexing, but some more use cases would be nice.
>>
>>
>
> At one point I did have something for soft/hard masks, but I think my final
> implementation went a different direction.  I would have to look around.  I
> do have a good use-case for soft masks.  For a given data, I wanted to
> produce several pcolors highlighting different regions.  A soft mask
> provided me a quick-n-easy way to change the mask without having to produce
> many copies of the original data.
>

That sounds cool, matplotlib will be a good place to do test modifications
while I'm doing the implementation.

> Masks are (relatively) easy when dealing with element-by-element operations
>>> that produces an array of the same shape (or at least the same number of
>>> elements in the case of reshape and transpose).  What gets difficult is for
>>> reductions such as sum or max, etc.  Then you get into the weirder cases
>>> such as unwrap and gradients that I brought up recently.  I am not sure how
>>> to address this, but I am not a fan of the idea of adding yet another
>>> parameter to the ufuncs to determine what to do for filling in a mask.
>>>
>>
>> It looks like in R there is a parameter called na.rm=T/F, which basically
>> means "remove NAs before doing the computation". This approach seems good to
>> me for reduction operations.
>>
>>
> Just to throw out some examples where these settings really do not make
> much sense.  For gradients and unwrap, maybe you want to skip na's, but
> still record the number of points you are skipping or maybe the points at
> na-boundaries become na's themselves.  Are we going to have something for
> each one of these possibilities?  Of course, this isn't even very well dealt
> with in masked arrays right now.
>

Yeah, for some functions dealing with NA values will need individual
per-function care. Probably they should raise by default until NA support is
implemented for them.

Another example of how we use masks in matplotlib is in pcolor().  We have
> to combine the possible masks of X, Y, and V in both the x and y directions
> to find the final mask to use for the final output result (because each
> facet needs valid data at each corner).  Having a soft-mask implementation
> allows one to create a temporary mask to use for the operation, and to share
> that mask across all the input data, but then let the data structures retain
> their original masks when done.
>

I will look at the implementation.

> Also, just to make things messier, there is an incomplete feature that was
>>> made for record arrays with regards to masking.  The idea was to allow for
>>> element-by-element masking, but also allow for row-by-row (or was it
>>> column-by-column?) masking.  I thought it was a neat feature, and it is too
>>> bad that it was not finished.
>>>
>>
>> I put this in my design, I think this would be useful too. I would call it
>> field by field, though many people like thinking of the struct dtype fields
>> as columns.
>>
>>
> Fields are fine.  I have found that there is no real consistency with how
> professionals refer to their rows and columns as "records" and "fields".  I
> learned data-handling from working on databases, but my naming convention
> often clashes with my some of my committee members who come from a stats
> background.
>

I prefer considering them like C structs, which is why I've started calling
them "struct dtypes". That name is also shorter than "structured dtypes".

>  Anyway, my opinion is that a mask should be True for a value that needs
>>> to be hidden.  Do not change this convention.  People coming into python
>>> already has to change code, a simple bit flip for them should be fine.
>>> Breaking existing python code is worse.
>>>
>>
>> I'm now thinking the mask needs to be pushed away into the background to
>> where it becomes be an unimportant implementation detail of the system. It
>> deserves a long cumbersome name like "validitymask", and then the system can
>> use something close R's approach with an NA-like singleton for most
>> operations.
>>
>
> Don't lose sight that we are really talking about two orthogonal (albeit,
> seemingly similar) concepts.  "missing" data and "ambiguous" data.  Both of
> these tools need to be at the forefront and the distinction needs to be made
> clear to the users so that they know which one they need in what situation.
> I think hiding masks is a bad idea.  I want numpy to be *better* than R by
> offering both features in a clear, non-conflicting manner.
>

That sounds good to me, we'll have to go through several design iterations
to shake out the details.

On a note somewhat similar to what I pointing out earlier with regards to
> soft masks.  One thing that is very nice about masked_arrays is that I can
> at any time turn a regular numpy array into a masked array without paying a
> penalty of having to re-assign the data.  Just need to make a separate mask
> object.
>

I believe my design sufficiently allows for this.

This is different from how one would operate with a na-dtype approach, where
> converting an array with a regular dtype into a na-dtype array would require
> a copy.  However, with proper dtype-handling, this may not be of much
> concern (non-na-dtype + na-dtype --> na-dtype, much like how int + float -->
> float).  Also loading functions could be told to cast to a na-dtype, which
> would then result in an array that is ready "out-of-the-box" as opposed to
> casting the masked array after the creation of the regular ndarray from a
> function like np.loadtxt().
>

The syntax for casting each element of a struct dtype to a new struct dtype
with all na-dtypes would be clumsy at first, there's a bunch of things that
would have to be figured out to make that all play nicely.

> Again, there are pros and cons either way and I see them very orthogonal
> and complementary.  Heck, I could even imagine situations where one might
> want a mask over an array with a na-dtype.
>

Maybe, but I'm kind of liking the idea of both of these use cases being
handled by the same underlying mechanism. I've updated the NEP, and will let
it bake for a bit I think.

-Mark

>
> Ben Root
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110624/cc362417/attachment.html>