[Numpy-discussion] missing data discussion round 2

Tue Jun 28 15:41:06 EDT 2011

On 06/28/2011 07:26 AM, Nathaniel Smith wrote:
> On Tue, Jun 28, 2011 at 9:38 AM, Charles R Harris
> <charlesr.harris at gmail.com>  wrote:
>> Nathaniel, an implementation using masks will look *exactly* like an
>> implementation using na-dtypes from the user's point of view. Except that
>> taking a masked view of an unmasked array allows ignoring values without
>> destroying or copying the original data.
>
> Charles, I know that :-).
>
> But if that view thing is an advertised feature -- in fact, the key
> selling point for the masking-based implementation, included
> specifically to make a significant contingent of users happy -- then
> it's certainly user-visible. And it will make other users unhappy,
> like I said. That's life.
>
> But who cares? My main point is that implementing a missing data
> solution and a separate masked array solution is probably less work
> than implementing a single everything-to-everybody solution *anyway*,
> *and* it might make both sets of users happier too. Notice that in my
> proposal, there's really nothing there that isn't already in Mark's
> NEP in some form or another, but in my version there's almost no
> overlap between the two features. That's not because I was trying to
> make them artificially different; it's because I tried to think of the
> most natural ways to satisfy each set of use cases, and they're just
> different.

I think you are exaggerating some of the differences associated with the 
implementation, and ignoring one *key* difference: for integer types, 
the masked implementation can handle the full numeric range of the type, 
while the bit-pattern approach cannot.

Balanced against that, the *key* advantages of the bit-pattern approach 
would seem to be the simplicity of using a single array, particularly 
for IO (including memmapping) and interfacing with extension code. 
Although I am a heavy user of masked arrays, I consider these 
bit-pattern advantages to be substantial and deserving of careful 
consideration--perhaps of more weight and planning than they have gotten 
so far.

Datasets on disk--e.g. climatological data, numerical model output, 
etc.--typically do use reserved values as missing value flags, although 
occasionally one also finds separate mask arrays.

One of the real frustrations of the present masked array is that there 
is no savez/load support.  I could roll my own by using a convention 
like saving the mask of xxx as xxx__mask__, and then reversing the 
process in a modified load; but I haven't gotten around to doing it. 
Regardless of internal implementation, I hope that core support for 
missing values will be included in savez/load.

Eric

>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion