[Numpy-discussion] alterNEP - was: missing data discussion round 2

Fri Jul 1 12:18:40 EDT 2011

On 07/01/2011 10:15 AM, Nathaniel Smith wrote:
> On Fri, Jul 1, 2011 at 7:09 AM, Mark Wiebe<mwwiebe at gmail.com>  wrote:
>> On Fri, Jul 1, 2011 at 6:58 AM, Matthew Brett<matthew.brett at gmail.com>
>> wrote:
>>> Do you see problems with the alterNEP proposal?
>> Yes, I really like my design as it stands now, and the alterNEP removes a
>> lot of the abstraction and interoperability that are in my opinion the best
>> parts. I've made more updates to the NEP based on continuing feedback, which
>> are part of the pull request I want reviews for.
>>
>>> If so, what are they?
>> Mainly: Reduced interoperability, more complex implementation (leading to
>> more bugs), and an unclear theoretical model for the masked part of it.
> Can you give any examples of situations where one would run into this
> "reduced interoperability"? I'm not sure what it means. The only
> person who has so far spoken up as needing both masking semantics and
> NA semantics -- Gary Strangman -- has said that he strongly prefers
> the alterNEP semantics *exactly because* it makes it clear *how these
> functions will interoperate.*
>
> Can you give any examples of how the implementation would be more
> complicated? As far as I can tell there are no elements in the
> alterNEP that are not in your NEP, they mostly just expose the
> functionality differently at the top level.
>
> Do you have a clearer theoretical model for the masked part of your
> proposal? The best I've been able to extract from any of your messages
> is when you wrote "it seems to me that people wanting masked arrays
> want missing data without touching their data". But as a matter of
> English grammar, I have no idea what this means -- if you have data,
> it's not missing! It seems to me that people wanting masked data want
> to *hide* parts of their data, which seems much clearer to me and is
> the theoretical model used in the alterNEP. Note that this model
> actually predicts several of the differences between how people want
> masks to work and how people want NAs to work (e.g., their behavior
> during reduction); I
>
>>> Do you agree that the alterNEP proposal is easier to understand?
>> No.
>>> If not, can you explain why?
>> My answers to that are already scattered in the emails in various places,
>> and in the various rationales and justifications provided in the NEP.
> I understand the desire not to get caught up in spending all your time
> writing emails explaining things that you feel like you've already
> explained.
>
> Maybe there's an email I missed somewhere where you explain the
> conceptual model behind your NEP's semantics in a short,
> easy-to-understand way (comparable to, say, the Rationale section of
> the alterNEP). But I haven't seen it and I can't reconstruct a
> rationale for it myself (the alterNEP comes out of my attempts to do
> so!).
>
>>> What do you see as the important points of difference between the NEP
>>> and the alterNEP?
>> The biggest thing is the NEP supports more use cases in a clean way by
>> composition of different simpler components. It defines one clear missing
>> data abstraction, and proposes two implementations that are interchangeable
>> and can interoperate.
> But the two implementations in your proposal are not interchangeable!
> The whole justification for starting with a masked-based
> implementation in your proposal is that it supports unmasking via
> views; if that requirement were removed, then there would be no reason
> to bother with the masking-based implementation at all.
>
> Well, that's not true. There are some marginal advantages in the
> special case of working with integers+NAs. But I don't think anyone's
> making that argument.
>
>> The alterNEP proposes two independent APIs, reducing
>> interoperability and so significantly increasing the amount of learning
>> required to work with both of them. This also precludes switching between
>> the two approaches without a lot of work.
> You can't switch between Python and C without a lot of work too, but
> that doesn't mean that they should be merged into one design... but
> they do complement each other beautifully. Just like missing data and
> masked arrays :-).
>
>> The current pull request that's sitting there waiting for review does not
>> have an impact on which approach goes ahead, but the code I'm doing now
>> does. This is a fairly large project, and I don't have a great length of
>> time to do it in, so I'm not going to participate extensively in the
>> alterNEP discussion. If you want to help me, please review my code and
>> provide specific feedback on my NEP (the code review system in github is
>> great for this too, I've received some excellent feedback on the NEP that
>> way). If you want to change my mind about things, please address the
>> specific design decisions you think are problematic by specifically
>> responding to lines in the NEP, as part of code-reviewing my pull request in
>> github.
> I know I'm being grumpy in this email, and I apologize for that. But,
> no. I've given extensive feedback, read the list carefully, and
> thought hard about these issues, and so far you've basically just
> dismissed my concerns. (See, e.g., [1], where your response to "we
> have to choose whether it's possible to recover data after it has been
> masked/NAed/whatever" is "no we don't, it should be both possible and
> impossible", which, I mean, what?) I've done my best to express them
> clearly, in the best way I know how -- and that way is *not* line by
> line comments on your NEP, because my concerns are more fundamental
> than that.
>
> I am of course happy to answer questions and such if there are places
> where I've been unclear.
>
> And of course it's your prerogative to decide how you want to spend
> your time (well, yours and your employer's, I guess), which forums you
> want to participate in, what code you want to write, etc. If you have
> decided that you are tired to talking about this and want to just go
> off and implement something, then good luck (and I do mean that, it
> isn't sarcasm).
>
> But as far as I can tell right now, every single person who has
> experience with handling missing data for statistical purposes (esp.
> in R) has real concerns about your proposal, and AFAICT the community
> has very much *not* reached consensus on how these features should
> look. So I guess my question is, once you've spent your limited time
> on writing this code -- how confident are you that it will be merged?
> This isn't a threat or anything, I have no power over what gets
> merged, but -- it seems to me that there's a real chance that you'll
> do this work and then it will go down in flames, or that it will be
> merged and then the people you're trying to target will ignore it
> anyway. This is why we try to build consensus first, right? I would
> love to find some way to make everyone happy (and have been doing what
> I can on that front), but right now I am not happy, other people are
> not happy, and you're communicating that you don't think that matters.
> I'd love for that to change.
>
> -- Nathaniel
>
> [1] http://mail.scipy.org/pipermail/numpy-discussion/2011-June/057274.html
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
I am sorry that that is NOT true - DON'T just lump every one into this 
when they have clearly stated the opposite! Missing values are nothing 
special to me, just reality. There are many statistical applications 
where masking is extremely common like outlier detection and flagging 
unusual observations (missing values is also masking). Just that you as 
a user have to do that yourself by creating and maintaining working 
variables.

I really find that you are 'splitting hairs' in your arguments as it 
really has to be up to the application on how missing values and NaN 
have to be handled. I see no difference between a missing value and a 
NaN because in virtually all statistical applications, both of these are 
dropped. This is what SAS typically does although certain procedure like 
FREQ allow you to treat missing values as 'valid'. R has slightly more 
flexibility since it differentiates missing valves and NaN. R allows you 
to decide how missing values are handled using arguments like na.rm or 
using na.fail, na.omit, na.exclude, na.pass functions.  But I think for 
the majority of cases (I'm not an R guru), R acts the same way as, by 
default (which is how most people use R) R excludes missing values and 
NaN's.

One of the problems I see here is that numpy has to work with a wide 
range of situations that neither R nor SAS or any other 
statistical-based language/application have to deal with. So you have 
suggest has to work for string, integer and data/time arrays.

I generally agree with what Chuck has said. But I know that while we 
have little say in some of numpy, we can file tickets that actually get 
some action. It is also how times change as this missing value topic has 
way more interest than previous times it has been raised. So I think we 
are gradually getting some positive awareness.

Bruce