[Numpy-discussion] Missing data again

Nathaniel Smith njs at pobox.com
Thu Mar 15 20:27:32 EDT 2012


Hi Chuck,

I think I let my frustration get the better of me, and the message
below is too confrontational. I apologize.

I truly would like to understand where you're coming from on this,
though, so I'll try to make this more productive. My summary of points
that no-one has disagreed with yet is here:
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
Of course, this means that there's lots that's left out. Instead of
getting into all those contentious details, I'll stick to just a few
basic questions that might let us get at least of bit of common
ground:
1) Do you disagree with anything that is stated there?
2) Do you feel like that document accurately summarises your basic
idea of what this feature is supposed to do (I assume under the
IGNORED heading)?

Thanks,
-- Nathaniel

On Wed, Mar 7, 2012 at 11:10 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Wed, Mar 7, 2012 at 7:37 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>> On Wed, Mar 7, 2012 at 12:26 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>> When it comes to "missing data", bitpatterns can do everything that
>>> masks can do, are no more complicated to implement, and have better
>>> performance characteristics.
>>>
>>
>> Maybe for float, for other things, no. And we have lots of otherthings.
>
> It would be easier to discuss this if you'd, like, discuss :-(. If you
> know of some advantage that masks have over bitpatterns when it comes
> to missing data, can you please share it, instead of just asserting
> it?
>
> Not that I'm immune... I perhaps should have been more explicit
> myself, when I said "performance characteristics", let me clarify that
> I was thinking of both speed (for floats) and memory (for
> most-but-not-all things).
>
>> The
>> performance is a strawman,
>
> How many users need to speak up to say that this is a serious problem
> they have with the current implementation before you stop calling it a
> strawman? Because when Wes says that it's not going to fly for his
> stats/econometics cases, and the neuroimaging folk like Gary and Matt
> say it's not going to fly for their use cases... surely just waving
> that away is a bit dismissive?
>
> I'm not saying that we *have* to implement bitpatterns because
> performance is *the most important feature* -- I'm just saying, well,
> what I said. For *missing data use* cases, bitpatterns have better
> performance characteristics than masks. If we decide that these use
> cases are important, then we should take this into account and weigh
> it against other considerations. Maybe what you think is that these
> use cases shouldn't be the focus of this feature and it should focus
> on the "ignored" use cases instead? That would be a legitimate
> argument... but if that's what you want to say, say it, don't just
> dismiss your users!
>
>> and it *isn't* easier to implement.
>
> If I thought bitpatterns would be easier to implement, I would have
> said so... What I said was that they're not harder. You have some
> extra complexity, mostly in casting, and some reduced complexity -- no
> need to allocate and manipulate the mask. (E.g., simple same-type
> assignments and slicing require special casing for masks, but not for
> bitpatterns.) In many places the complexity is identical -- printing
> routines need to check for either special bitpatterns or masked
> values, whatever. Ufunc loops need to either find the appropriate part
> of the mask, or create a temporary mask buffer by calling a dtype
> func, whatever. On net they seem about equivalent, complexity-wise.
>
> ...I assume you disagree with this analysis, since I've said it
> before, wrote up a sketch for how the implementation would work at the
> C level, etc., and you continue to claim that simplicity is a
> compelling advantage for the masked approach. But I still don't know
> why you think that :-(.
>
>>> > Also, different folks adopt different values
>>> > for 'missing' data, and distributing one or several masks along with the
>>> > data is another common practice.
>>>
>>> True, but not really relevant to the current debate, because you have
>>> to handle such issues as part of your general data import workflow
>>> anyway, and none of these is any more complicated no matter which
>>> implementations are available.
>>>
>>> > One inconvenience I have run into with the current API is that is should
>>> > be
>>> > easier to clear the mask from an "ignored" value without taking a new
>>> > view
>>> > or assigning known data. So maybe two types of masks (different
>>> > payloads),
>>> > or an additional flag could be helpful. The process of assigning masks
>>> > could
>>> > also be made a bit easier than using fancy indexing.
>>>
>>> So this, uh... this was actually the whole goal of the "alterNEP"
>>> design for masks -- making all this stuff easy for people (like you,
>>> apparently?) that want support for ignored values, separately from
>>> missing data, and want a nice clean API for it. Basically having a
>>> separate .mask attribute which was an ordinary, assignable array
>>> broadcastable to the attached array's shape. Nobody seemed interested
>>> in talking about it much then but maybe there's interest now?
>>>
>>
>> Come off it, Nathaniel, the problem is minor and fixable. The intent of the
>> initial implementation was to discover such things.
>
> Implementation can be wonderful, I absolutely agree. But you
> understand that I'd be more impressed by this example if your
> discovery weren't something I had been arguing for since before the
> implementation began :-).
>
>> These things are less
>> accessible with the current API *precisely* because of the feedback from R
>> users. It didn't start that way.
>>
>> We now have something to evolve into what we want. That is a heck of a lot
>> more useful than endless discussion.
>
> No, you are still missing the point completely! There is no "what *we*
> want", because what you want is different than what I want. The
> masking stuff in the alterNEP was an attempt to give people like you
> who wanted "ignored" support what they wanted, and the bitpattern
> stuff was to satisfy people like me who want "missing data" support.
> The NEP took a different approach to trying to make everyone happy...
> unfortunately it sounds like it made no-one happy. Blaming the R users
> for this isn't *wrong*, exactly, but it's a bit one-sided.
>
> If you have a proposal for how the current code can be "evolved" into
> something that will make the neuro/econ/stats people happy, then
> please tell us. But I don't see how it's possible, and your current
> proposals are going in the wrong direction. Unless we can actually
> talk about these disagreements, we're just going to have more endless
> discussion.
>
> -- Nathaniel



More information about the NumPy-Discussion mailing list