[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Eric Firing efiring at hawaii.edu
Thu Jun 23 17:54:45 EDT 2011


On 06/23/2011 11:19 AM, Nathaniel Smith wrote:
> I'd like to see a statement of what the "missing data problem" is, and
> how this solves it? Because I don't think this is entirely intuitive,
> or that everyone necessarily has the same idea.
>
>> Reduction operations like 'sum', 'prod', 'min', and 'max' will operate as if the values weren't there
>
> For context: My experience with missing data is in statistical
> analysis; I find R's NA support to be pretty awesome for those
> purposes. The conceptual model it's based on is that an NA value is
> some number that we just happen not to know. So from this perspective,
> I find it pretty confusing that adding an unknown quantity to 3 should
> result in 3, rather than another unknown quantity. (Obviously it
> should be possible to compute the sum of the known values, but IME
> it's important for the default behavior to be to fail loudly when
> things are wonky, not to silently patch them up, possibly
> incorrectly!)

 From the oceanographic data acquisition and analysis perspective, and 
perhaps from a more general plotting perspective (matplotlib, 
specifically) missing data is simply missing; we don't have it, we never 
will, but we need to do the best calculation (or plot) we can with what 
is left.  For plotting, that generally means showing a gap in a line, a 
hole in a contour plot, etc.  For calculations like basic statistics, it 
means doing the calculation, e.g. a mean, with the available numbers, 
*and* having an easy way to find out how many numbers were available. 
That's what the masked array count() method is for.

Some types of calculations, like the FFT, simply can't be done by 
ignoring missing values, so one must first use some filling method, 
perhaps interpolation, for example, and then pass an unmasked array to 
the function.

The present masked array module is very close to what is really needed 
for the sorts of things I am involved with.  It looks to me like the 
main deficiencies are addressed by Mark's proposal, although the change 
in the definition of the mask might make for a painful transition.

Eric

>
> Also, what should 'dot' do with missing values?
>
> -- Nathaniel
>
> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe<mwwiebe at gmail.com>  wrote:
>> Enthought has asked me to look into the "missing data" problem and how NumPy
>> could treat it better. I've considered the different ideas of adding dtype
>> variants with a special signal value and masked arrays, and concluded that
>> adding masks to the core ndarray appears is the best way to deal with the
>> problem in general.
>> I've written a NEP that proposes a particular design, viewable here:
>> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
>> There are some questions at the bottom of the NEP which definitely need
>> discussion to find the best design choices. Please read, and let me know of
>> all the errors and gaps you find in the document.
>> Thanks,
>> Mark
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion




More information about the NumPy-Discussion mailing list