[Numpy-discussion] missing data discussion round 2

Wed Jun 29 14:49:11 EDT 2011

On 06/29/2011 01:07 PM, Dag Sverre Seljebotn wrote:
> On 06/29/2011 07:38 PM, Mark Wiebe wrote:
>> On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no<mailto:d.s.seljebotn at astro.uio.no>>  wrote:
>>
>>      On 06/29/2011 03:45 PM, Matthew Brett wrote:
>>       >  Hi,
>>       >
>>       >  On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe<mwwiebe at gmail.com
>>      <mailto:mwwiebe at gmail.com>>   wrote:
>>       >>  On Tue, Jun 28, 2011 at 5:20 PM, Matthew
>>      Brett<matthew.brett at gmail.com<mailto:matthew.brett at gmail.com>>
>>       >>  wrote:
>>       >>>
>>       >>>  Hi,
>>       >>>
>>       >>>  On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs at pobox.com
>>      <mailto:njs at pobox.com>>   wrote:
>>       >>>  ...
>>       >>>>  (You might think, what difference does it make if you *can*
>>      unmask an
>>       >>>>  item? Us missing data folks could just ignore this feature. But:
>>       >>>>  whatever we end up implementing is something that I will have to
>>       >>>>  explain over and over to different people, most of them not
>>       >>>>  particularly sophisticated programmers. And there's just no
>>      sensible
>>       >>>>  way to explain this idea that if you store some particular
>>      value, then
>>       >>>>  it replaces the old value, but if you store NA, then the old
>>      value is
>>       >>>>  still there.
>>       >>>
>>       >>>  Ouch - yes.  No question, that is difficult to explain.   Well, I
>>       >>>  think the explanation might go like this:
>>       >>>
>>       >>>  "Ah, yes, well, that's because in fact numpy records missing
>>      values by
>>       >>>  using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
>>       >>>  'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>>       >>>
>>       >>>  Is that fair?
>>       >>
>>       >>  My favorite way of explaining it would be to have a grid of
>>      numbers written
>>       >>  on paper, then have several cardboards with holes poked in them
>>      in different
>>       >>  configurations. Placing these cardboard masks in front of the
>>      grid would
>>       >>  show different sets of non-missing data, without affecting the
>>      values stored
>>       >>  on the paper behind them.
>>       >
>>       >  Right - but here of course you are trying to explain the mask, and
>>       >  this is Nathaniel's point, that in order to explain NAs, you have to
>>       >  explain masks, and so, even at a basic level, the fusion of the two
>>       >  ideas is obvious, and already confusing.  I mean this:
>>       >
>>       >  a[3] = np.NA
>>       >
>>       >  "Oh, so you just set the a[3] value to have some missing value code?"
>>       >
>>       >  "Ah - no - in fact what I did was set a associated mask in position
>>       >  a[3] so that you can't any longer see the previous value of a[3]"
>>       >
>>       >  "Huh.  You mean I have a mask for every single value in order to be
>>       >  able to blank out a[3]?  It looks like an assignment.  I mean, it
>>       >  looks just like a[3] = 4.  But I guess it isn't?"
>>       >
>>       >  "Er..."
>>       >
>>       >  I think Nathaniel's point is a very good one - these are separate
>>       >  ideas, np.NA and np.IGNORE, and a joint implementation is bound to
>>       >  draw them together in the mind of the user.    Apart from anything
>>       >  else, the user has to know that, if they want a single NA value in an
>>       >  array, they have to add a mask size array.shape in bytes.  They have
>>       >  to know then, that NA is implemented by masking, and then the 'NA for
>>       >  free by adding masking' idea breaks down and starts to feel like a
>>       >  kludge.
>>       >
>>       >  The counter argument is of course that, in time, the
>>      implementation of
>>       >  NA with masking will seem as obvious and intuitive, as, say,
>>       >  broadcasting, and that we are just reacting from lack of experience
>>       >  with the new API.
>>
>>      However, no matter how used we get to this, people coming from almost
>>      any other tool (in particular R) will keep think it is
>>      counter-intuitive. Why set up a major semantic incompatability that
>>      people then have to overcome in order to start using NumPy.
>>
>>
>> I'm not aware of a semantic incompatibility. I believe R doesn't support
>> views like NumPy does, so the things you have to do to see masking
>> semantics aren't even possible in R.
> Well, whether the same feature is possible or not in R is irrelevant to
> whether a semantic incompatability would exist.
>
> Views themselves are a *major* semantic incompatability, and are highly
> confusing at first to MATLAB/Fortran/R people. However they have major
> advantages outweighing the disadvantage of having to caution new users.
>
> But there's simply no precedence anywhere for an assignment that doesn't
> erase the old value for a particular input value, and the advantages
> seem pretty minor (well, I think it is ugly in its own right, but that
> is besides the point...)
>
> Dag Sverre
> _______________________________________________
Depending on what you really mean by 'precedence', in most stats 
software (R, SAS, etc.) it is completely up to the user to do this and 
do it correctly. Usually you store the original data and create new 
working data as needed as either the equivalent of a masked array or a 
view. Quite often I have need to have the same variable as a float and a 
string to be able to have the bets of both worlds (otherwise it is a 
pain of not only going back and forth but ensuring the right type is 
being used at the right time). The really really huge advantages of a 
masked arrays is that you only need the original data plus the mask and 
it is so easy to find the 'flagged' observations.

There is actually no reason why you can not create a masked array or 
views in R - it is a language after all with the necessary features. So 
I do not see any semantic issues rather that it has not been implemented 
because there is insufficient interest. A likely reason is R's very 
strong statistical heritage so it has been up to the user to make their 
data fit the existing statistical routines.

Really all this discussion just further highlights the value of masked 
arrays as it really extends the usage of some missing value coding. At 
present the cost is memory and the performance is a wait and see as 
there may not be a difference because in both cases the code has to know 
if a element is masked or missing and then how to deal with it.

Bruce