[Numpy-discussion] new MaskedArray class

Mon Jun 24 21:03:38 EDT 2019

On Mon, Jun 24, 2019 at 5:36 PM Marten van Kerkwijk <
m.h.vankerkwijk at gmail.com> wrote:

>
>
> On Mon, Jun 24, 2019 at 7:21 PM Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhaldane at gmail.com>
>> wrote:
>>
>>> I'm not at all set on that behavior and we can do something else. For
>>> now, I chose this way since it seemed to best match the "IGNORE" mask
>>> behavior.
>>>
>>> The behavior you described further above where the output row/col would
>>> be masked corresponds better to "NA" (propagating) mask behavior, which
>>> I am leaving for later implementation.
>>
>>
>> This does seem like a clean way to *implement* things, but from a user
>> perspective I'm not sure I would want separate classes for "IGNORE" vs "NA"
>> masks.
>>
>> I tend to think of "IGNORE" vs "NA" as descriptions of particular
>> operations rather than the data itself. There are a spectrum of ways to
>> handle missing data, and the right way to propagating missing values is
>> often highly context dependent. The right way to set this is in functions
>> where operations are defined, not on classes that may be defined far away
>> from where the computation happen. For example, pandas has a "min_count"
>> parameter in functions for intermediate use-cases between "IGNORE" and "NA"
>> semantics, e.g., "take an average, unless the number of data points is
>> fewer than min_count."
>>
>
> Anything that specific like that is probably indeed outside of the purview
> of a MaskedArray class.
>

I agree that it doesn't make much sense to have a "min_count" attribute on
a MaskedArray class, but certainly it makes sense for operations on
MaskedArray objects, e.g., to write something like
masked_array.mean(min_count=10). This is what users do in pandas today.

> But your general point is well taken: we really need to ask clearly what
> the mask means not in terms of operations but conceptually.
>
> Personally, I guess like Benjamin I have mostly thought of it as "data
> here is bad" (because corrupted, etc.) or "data here is irrelevant"
> (because of sea instead of land in a map). And I would like to proceed
> nevertheless with calculating things on the remainder. For an expectation
> value (or, less obviously, a minimum or maximum), this is mostly OK: just
> ignore the masked elements. But even for something as simple as a sum, what
> is correct is not obvious: if I ignore the count, I'm effectively assuming
> the expectation is symmetric around zero (this is why `vector.dot(vector)`
> fails); a better estimate would be `np.add.reduce(data, where=~mask) *
> N(total) / N(unmasked)`.
>

I think it's fine and logical to define default semantics for operations on
MaskedArray objects. Much of the time, replacing masked values with 0 is
the right thing to do for sum. Certainly IGNORE semantics are more useful
overall than the NA semantics.

But even if a MaskedArray conceptually always represents "bad" or
"irrelevant" data, the way to handle those missing values will differ based
on the use case, and not everything will fall cleanly into either IGNORE or
NA buckets. I think it makes sense to provide users with functions/methods
that expose these options, rather than requiring that they convert their
data into a different type MaskedArray.

"It is better to have 100 functions operate on one data structure than 10
functions on 10 data structures." —Alan Perlis
https://stackoverflow.com/questions/6016271/why-is-it-better-to-have-100-functions-operate-on-one-data-structure-than-10-fun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190624/041680d4/attachment-0001.html>