[Numpy-discussion] missing data discussion round 2

Wed Jun 29 14:07:45 EDT 2011

On 06/29/2011 07:38 PM, Mark Wiebe wrote:
> On Wed, Jun 29, 2011 at 9:35 AM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>> wrote:
>
>     On 06/29/2011 03:45 PM, Matthew Brett wrote:
>      > Hi,
>      >
>      > On Wed, Jun 29, 2011 at 12:39 AM, Mark Wiebe<mwwiebe at gmail.com
>     <mailto:mwwiebe at gmail.com>>  wrote:
>      >> On Tue, Jun 28, 2011 at 5:20 PM, Matthew
>     Brett<matthew.brett at gmail.com <mailto:matthew.brett at gmail.com>>
>      >> wrote:
>      >>>
>      >>> Hi,
>      >>>
>      >>> On Tue, Jun 28, 2011 at 4:06 PM, Nathaniel Smith<njs at pobox.com
>     <mailto:njs at pobox.com>>  wrote:
>      >>> ...
>      >>>> (You might think, what difference does it make if you *can*
>     unmask an
>      >>>> item? Us missing data folks could just ignore this feature. But:
>      >>>> whatever we end up implementing is something that I will have to
>      >>>> explain over and over to different people, most of them not
>      >>>> particularly sophisticated programmers. And there's just no
>     sensible
>      >>>> way to explain this idea that if you store some particular
>     value, then
>      >>>> it replaces the old value, but if you store NA, then the old
>     value is
>      >>>> still there.
>      >>>
>      >>> Ouch - yes.  No question, that is difficult to explain.   Well, I
>      >>> think the explanation might go like this:
>      >>>
>      >>> "Ah, yes, well, that's because in fact numpy records missing
>     values by
>      >>> using a 'mask'.   So when you say `a[3] = np.NA', what you mean is,
>      >>> 'a._mask = np.ones(a.shape, np.dtype(bool); a._mask[3] = False`"
>      >>>
>      >>> Is that fair?
>      >>
>      >> My favorite way of explaining it would be to have a grid of
>     numbers written
>      >> on paper, then have several cardboards with holes poked in them
>     in different
>      >> configurations. Placing these cardboard masks in front of the
>     grid would
>      >> show different sets of non-missing data, without affecting the
>     values stored
>      >> on the paper behind them.
>      >
>      > Right - but here of course you are trying to explain the mask, and
>      > this is Nathaniel's point, that in order to explain NAs, you have to
>      > explain masks, and so, even at a basic level, the fusion of the two
>      > ideas is obvious, and already confusing.  I mean this:
>      >
>      > a[3] = np.NA
>      >
>      > "Oh, so you just set the a[3] value to have some missing value code?"
>      >
>      > "Ah - no - in fact what I did was set a associated mask in position
>      > a[3] so that you can't any longer see the previous value of a[3]"
>      >
>      > "Huh.  You mean I have a mask for every single value in order to be
>      > able to blank out a[3]?  It looks like an assignment.  I mean, it
>      > looks just like a[3] = 4.  But I guess it isn't?"
>      >
>      > "Er..."
>      >
>      > I think Nathaniel's point is a very good one - these are separate
>      > ideas, np.NA and np.IGNORE, and a joint implementation is bound to
>      > draw them together in the mind of the user.    Apart from anything
>      > else, the user has to know that, if they want a single NA value in an
>      > array, they have to add a mask size array.shape in bytes.  They have
>      > to know then, that NA is implemented by masking, and then the 'NA for
>      > free by adding masking' idea breaks down and starts to feel like a
>      > kludge.
>      >
>      > The counter argument is of course that, in time, the
>     implementation of
>      > NA with masking will seem as obvious and intuitive, as, say,
>      > broadcasting, and that we are just reacting from lack of experience
>      > with the new API.
>
>     However, no matter how used we get to this, people coming from almost
>     any other tool (in particular R) will keep think it is
>     counter-intuitive. Why set up a major semantic incompatability that
>     people then have to overcome in order to start using NumPy.
>
>
> I'm not aware of a semantic incompatibility. I believe R doesn't support
> views like NumPy does, so the things you have to do to see masking
> semantics aren't even possible in R.

Well, whether the same feature is possible or not in R is irrelevant to 
whether a semantic incompatability would exist.

Views themselves are a *major* semantic incompatability, and are highly 
confusing at first to MATLAB/Fortran/R people. However they have major 
advantages outweighing the disadvantage of having to caution new users.

But there's simply no precedence anywhere for an assignment that doesn't 
erase the old value for a particular input value, and the advantages 
seem pretty minor (well, I think it is ugly in its own right, but that 
is besides the point...)

Dag Sverre