[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Sat Oct 29 14:14:54 EDT 2011

On Fri, Oct 28, 2011 at 9:32 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Fri, Oct 28, 2011 at 6:45 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> >
>> >
>> > On Friday, October 28, 2011, Matthew Brett <matthew.brett at gmail.com>
>> > wrote:
>> >> Hi,
>> >>
>> >> On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers
>> >> <ralf.gommers at googlemail.com> wrote:
>> >>>
>> >>>
>> >>> On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett
>> >>> <matthew.brett at gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris
>> >>>> <charlesr.harris at gmail.com> wrote:
>> >>>> >
>> >>>> >
>> >>>> > On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett
>> >>>> > <matthew.brett at gmail.com>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi,
>> >>>> >>
>> >>>> >> On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett
>> >>>> >> <matthew.brett at gmail.com>
>> >>>> >> wrote:
>> >>>> >> > Hi,
>> >>>> >> >
>> >>>> >> > On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
>> >>>> >> > <charlesr.harris at gmail.com> wrote:
>> >>>> >> >>
>> >>>> >> >>
>> >>>> >> >> On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith
>> >>>> >> >> <njs at pobox.com>
>> >>>> >> >> wrote:
>> >>>> >> >>>
>> >>>> >> >>> On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant
>> >>>> >> >>> <oliphant at enthought.com>
>> >>>> >> >>> wrote:
>> >>>> >> >>> > I think Nathaniel and Matthew provided very
>> >>>> >> >>> > specific feedback that was helpful in understanding other
>> >>>> >> >>> > perspectives
>> >>>> >> >>> > of a
>> >>>> >> >>> > difficult problem.     In particular, I really wanted
>> >>>> >> >>> > bit-patterns
>> >>>> >> >>> > implemented.    However, I also understand that Mark did
>> >>>> >> >>> > quite
>> >>>> >> >>> > a
>> >>>> >> >>> > bit
>> >>>> >> >>> > of
>> >>>> >> >>> > work
>> >>>> >> >>> > and altered his original designs quite a bit in response to
>> >>>> >> >>> > community
>> >>>> >> >>> > feedback.   I wasn't a major part of the pull request
>> >>>> >> >>> > discussion,
>> >>>> >> >>> > nor
>> >>>> >> >>> > did I
>> >>>> >> >>> > merge the changes, but I support Charles if he reviewed the
>> >>>> >> >>> > code
>> >>>> >> >>> > and
>> >>>> >> >>> > felt
>> >>>> >> >>> > like it was the right thing to do.  I likely would have done
>> >>>> >> >>> > the
>> >>>> >> >>> > same
>> >>>> >> >>> > thing
>> >>>> >> >>> > rather than let Mark Wiebe's work languish.
>> >>>> >> >>>
>> >>>> >> >>> My connectivity is spotty this week, so I'll stay out of the
>> >>>> >> >>> technical
>> >>>> >> >>> discussion for now, but I want to share a story.
>> >>>> >> >>>
>> >>>> >> >>> Maybe a year ago now, Jonathan Taylor and I were debating what
>> >>>> >> >>> the
>> >>>> >> >>> best API for describing statistical models would be -- whether
>> >>>> >> >>> we
>> >>>> >> >>> wanted something like R's "formulas" (which I supported), or
>> >>>> >> >>> another
>> >>>> >> >>> approach based on sympy (his idea). To summarize, I thought
>> >>>> >> >>> his
>> >>>> >> >>> API
>> >>>> >> >>> was confusing, pointlessly complicated, and didn't actually
>> >>>> >> >>> solve
>> >>>> >> >>> the
>> >>>> >> >>> problem; he thought R-style formulas were superficially
>> >>>> >> >>> simpler
>> >>>> >> >>> but
>> >>>> >> >>> hopelessly confused and inconsistent underneath. Now,
>> >>>> >> >>> obviously,
>> >>>> >> >>> I
>> >>>> >> >>> was
>> >>>> >> >>> right and he was wrong. Well, obvious to me, anyway... ;-) But
>> >>>> >> >>> it
>> >>>> >> >>> wasn't like I could just wave a wand and make his arguments go
>> >>>> >> >>> away,
>> >>>> >> >>> no I should point out that the implementation hasn't - as far
>> >>>> >> >>> as
>> >>>> >> >>> I can
>> >> see - changed the discussion.  The discussion was about the API.
>> >> Implementations are useful for agreed APIs because they can point out
>> >> where the API does not make sense or cannot be implemented.  In this
>> >> case, the API Mark said he was going to implement - he did implement -
>> >> at least as far as I can see.  Again, I'm happy to be corrected.
>> >>
>> >>>> In saying that we are insisting on our way, you are saying,
>> >>>> implicitly,
>> >>>> 'I
>> >>>> am not going to negotiate'.
>> >>>
>> >>> That is only your interpretation. The observation that Mark
>> >>> compromised
>> >>> quite a bit while you didn't seems largely correct to me.
>> >>
>> >> The problem here stems from our inability to work towards agreement,
>> >> rather than standing on set positions.  I set out what changes I think
>> >> would make the current implementation OK.  Can we please, please have
>> >> a discussion about those points instead of trying to argue about who
>> >> has given more ground.
>> >>
>> >>> That commitment would of course be good. However, even if that were
>> >>> possible
>> >>> before writing code and everyone agreed that the ideas of you and
>> >>> Nathaniel
>> >>> should be implemented in full, it's still not clear that either of you
>> >>> would
>> >>> be willing to write any code. Agreement without code still doesn't
>> >>> help
>> >>> us
>> >>> very much.
>> >>
>> >> I'm going to return to Nathaniel's point - it is a highly valuable
>> >> thing to set ourselves the target of resolving substantial discussions
>> >> by consensus.   The route you are endorsing here is 'implementor
>> >> wins'.   We don't need to do it that way.  We're a mature sensible
>> >> bunch of adults who can talk out the issues until we agree they are
>> >> ready for implementation, and then implement.  That's all Nathaniel is
>> >> saying.  I think he's obviously right, and I'm sad that it isn't as
>> >> clear to y'all as it is to me.
>> >>
>> >> Best,
>> >>
>> >> Matthew
>> >>
>> >
>> > Everyone, can we please not do this?! I had enough of adults doing
>> > finger
>> > pointing back over the summer during the whole debt ceiling debate.  I
>> > think
>> > we can all agree that we are better than the US congress?
>> >
>> > Forget about rudeness or decision processes.
>> >
>> > I will start by saying that I am willing to separate ignore and absent,
>> > but
>> > only on the write side of things.  On read, I want a single way to
>> > identify
>> > the missing values.  I also want only a single way to perform
>> > calculations
>> > (either skip or propagate).
>> >
>> > An indicator of success would be that people stop using NaNs and magic
>> > numbers (-9999, anyone?) and we could even deprecate nansum(), or at
>> > least
>> > strongly suggest in its docs to use NA.
>>
>> Well, I haven't completely made up my mind yet, will have to do some
>> more prototyping and playing (and potentially have some of my users
>> eat the differently-flavored dogfood), but I'm really not very
>> satisfied with the API at the moment. I'm mainly worried about the
>> abstraction leaking through to pandas users (this is a pretty large
>> group of people judging by # of downloads).
>>
>> The basic position I'm in is that I'm trying to push Python into a new
>> space, namely mainstream data analysis and statistical computing, one
>> that is solidly occupied by R and other such well-known players. My
>> target users are not computer scientists. They are not going to invest
>> in understanding dtypes very deeply or the internals of ndarray. In
>> fact I've spent a great deal of effort making it so that pandas users
>> can be productive and successful while having very little
>> understanding of NumPy. Yes, I essentially "protect" my users from
>> NumPy because using it well requires a certain level of sophistication
>> that I think is unfair to demand of people. This might seem totally
>> bizarre to some of you but it is simply the state of affairs. So far I
>> have been successful because more people are using Python and pandas
>> to do things that they used to do in R. The NA concept in R is dead
>> simple and I don't see why we are incapable of also implementing
>> something that is just as dead simple. To we, the scipy elite let's
>> call us, it seems simple: "oh, just pass an extra flag to all my array
>> constructors!" But this along with the masked array concept is going
>> to have two likely outcomes:
>>
>> 1) Create a great deal more complication in my already very large codebase
>>
>> and/or
>>
>> 2) force pandas users to understand the new masked arrays after I've
>> carefully made it so they can be largely ignorant of NumPy
>>
>> The mostly-NaN-based solution I've cobbled together and tweaked over
>> the last 42 months actually *works really well*, amazingly, with
>> relatively little cost in code complexity. Having found a reasonably
>> stable equilibrium I'm extremely resistant to upset the balance.
>>
>> So I don't know. After watching these threads bounce back and forth
>> I'm frankly not all that hopeful about a solution arising that
>> actually addresses my needs.
>
> But Wes, what *are* your needs? You keep saying this, but we need examples
> of how you want to operate and how numpy fails. As to dtypes, internals, and
> all that, I don't see any of that in the current implementation, unless you
> mean the maskna and skipna keywords. I believe someone on the previous
> thread mentioned a way to deal with that.
>
> Chuck
>

Here are my needs:

1) How NAs are implemented cannot be end user visible. Having to pass
maskna=True is a problem. I suppose a solution is to set the flag to
true on every array inside of pandas so the user never knows (you
mentioned someone else had some other solution, i could go back and
dig it up?)

2) Performance: I can't accept more than say 2x overhead in floating
point array operations (binary ops or reductions). Last time I checked
we were a long way away from that

3) Implementation of NA-aware algorithms in Cython. A lot of pandas is
about moving data around. Bit patterns would make life a lot easier
because the code wouldn't have to change (much). But with masked
arrays I'll have to move both data and mask values. Not the end of the
world but is just the price you pay, I guess.

Things in R are a bit simpler re: bit patterns because there's only
double, integer, string (character), and boolean dtypes, whereas NumPy
has the whole C type hierarchy. So I can appreciate that doing bit
patterns across all the dtypes would be really hard.

In any case, I recognize that the current implementation will be
useful to a lot of people, but it may not meet my performance and
usability requirements. As I said, the solution I've cooked up has
worked well so far, and since it isn't a major pain point I may just
adopt the "ain't broke, don't fix" attitude and focus my efforts on
building new features. "Practicality beats purity", I suppose

- W

> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>