[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Sat Oct 29 14:43:41 EDT 2011

On Sat, Oct 29, 2011 at 12:14 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> On Fri, Oct 28, 2011 at 9:32 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Fri, Oct 28, 2011 at 6:45 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> On Fri, Oct 28, 2011 at 7:53 PM, Benjamin Root <ben.root at ou.edu> wrote:
> >> >
> >> >
> >> > On Friday, October 28, 2011, Matthew Brett <matthew.brett at gmail.com>
> >> > wrote:
> >> >> Hi,
> >> >>
> >> >> On Fri, Oct 28, 2011 at 4:21 PM, Ralf Gommers
> >> >> <ralf.gommers at googlemail.com> wrote:
> >> >>>
> >> >>>
> >> >>> On Sat, Oct 29, 2011 at 12:37 AM, Matthew Brett
> >> >>> <matthew.brett at gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> On Fri, Oct 28, 2011 at 3:14 PM, Charles R Harris
> >> >>>> <charlesr.harris at gmail.com> wrote:
> >> >>>> >
> >> >>>> >
> >> >>>> > On Fri, Oct 28, 2011 at 3:56 PM, Matthew Brett
> >> >>>> > <matthew.brett at gmail.com>
> >> >>>> > wrote:
> >> >>>> >>
> >> >>>> >> Hi,
> >> >>>> >>
> >> >>>> >> On Fri, Oct 28, 2011 at 2:43 PM, Matthew Brett
> >> >>>> >> <matthew.brett at gmail.com>
> >> >>>> >> wrote:
> >> >>>> >> > Hi,
> >> >>>> >> >
> >> >>>> >> > On Fri, Oct 28, 2011 at 2:41 PM, Charles R Harris
> >> >>>> >> > <charlesr.harris at gmail.com> wrote:
> >> >>>> >> >>
> >> >>>> >> >>
> >> >>>> >> >> On Fri, Oct 28, 2011 at 3:16 PM, Nathaniel Smith
> >> >>>> >> >> <njs at pobox.com>
> >> >>>> >> >> wrote:
> >> >>>> >> >>>
> >> >>>> >> >>> On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant
> >> >>>> >> >>> <oliphant at enthought.com>
> >> >>>> >> >>> wrote:
> >> >>>> >> >>> > I think Nathaniel and Matthew provided very
> >> >>>> >> >>> > specific feedback that was helpful in understanding other
> >> >>>> >> >>> > perspectives
> >> >>>> >> >>> > of a
> >> >>>> >> >>> > difficult problem.     In particular, I really wanted
> >> >>>> >> >>> > bit-patterns
> >> >>>> >> >>> > implemented.    However, I also understand that Mark did
> >> >>>> >> >>> > quite
> >> >>>> >> >>> > a
> >> >>>> >> >>> > bit
> >> >>>> >> >>> > of
> >> >>>> >> >>> > work
> >> >>>> >> >>> > and altered his original designs quite a bit in response
> to
> >> >>>> >> >>> > community
> >> >>>> >> >>> > feedback.   I wasn't a major part of the pull request
> >> >>>> >> >>> > discussion,
> >> >>>> >> >>> > nor
> >> >>>> >> >>> > did I
> >> >>>> >> >>> > merge the changes, but I support Charles if he reviewed
> the
> >> >>>> >> >>> > code
> >> >>>> >> >>> > and
> >> >>>> >> >>> > felt
> >> >>>> >> >>> > like it was the right thing to do.  I likely would have
> done
> >> >>>> >> >>> > the
> >> >>>> >> >>> > same
> >> >>>> >> >>> > thing
> >> >>>> >> >>> > rather than let Mark Wiebe's work languish.
> >> >>>> >> >>>
> >> >>>> >> >>> My connectivity is spotty this week, so I'll stay out of the
> >> >>>> >> >>> technical
> >> >>>> >> >>> discussion for now, but I want to share a story.
> >> >>>> >> >>>
> >> >>>> >> >>> Maybe a year ago now, Jonathan Taylor and I were debating
> what
> >> >>>> >> >>> the
> >> >>>> >> >>> best API for describing statistical models would be --
> whether
> >> >>>> >> >>> we
> >> >>>> >> >>> wanted something like R's "formulas" (which I supported), or
> >> >>>> >> >>> another
> >> >>>> >> >>> approach based on sympy (his idea). To summarize, I thought
> >> >>>> >> >>> his
> >> >>>> >> >>> API
> >> >>>> >> >>> was confusing, pointlessly complicated, and didn't actually
> >> >>>> >> >>> solve
> >> >>>> >> >>> the
> >> >>>> >> >>> problem; he thought R-style formulas were superficially
> >> >>>> >> >>> simpler
> >> >>>> >> >>> but
> >> >>>> >> >>> hopelessly confused and inconsistent underneath. Now,
> >> >>>> >> >>> obviously,
> >> >>>> >> >>> I
> >> >>>> >> >>> was
> >> >>>> >> >>> right and he was wrong. Well, obvious to me, anyway... ;-)
> But
> >> >>>> >> >>> it
> >> >>>> >> >>> wasn't like I could just wave a wand and make his arguments
> go
> >> >>>> >> >>> away,
> >> >>>> >> >>> no I should point out that the implementation hasn't - as
> far
> >> >>>> >> >>> as
> >> >>>> >> >>> I can
> >> >> see - changed the discussion.  The discussion was about the API.
> >> >> Implementations are useful for agreed APIs because they can point out
> >> >> where the API does not make sense or cannot be implemented.  In this
> >> >> case, the API Mark said he was going to implement - he did implement
> -
> >> >> at least as far as I can see.  Again, I'm happy to be corrected.
> >> >>
> >> >>>> In saying that we are insisting on our way, you are saying,
> >> >>>> implicitly,
> >> >>>> 'I
> >> >>>> am not going to negotiate'.
> >> >>>
> >> >>> That is only your interpretation. The observation that Mark
> >> >>> compromised
> >> >>> quite a bit while you didn't seems largely correct to me.
> >> >>
> >> >> The problem here stems from our inability to work towards agreement,
> >> >> rather than standing on set positions.  I set out what changes I
> think
> >> >> would make the current implementation OK.  Can we please, please have
> >> >> a discussion about those points instead of trying to argue about who
> >> >> has given more ground.
> >> >>
> >> >>> That commitment would of course be good. However, even if that were
> >> >>> possible
> >> >>> before writing code and everyone agreed that the ideas of you and
> >> >>> Nathaniel
> >> >>> should be implemented in full, it's still not clear that either of
> you
> >> >>> would
> >> >>> be willing to write any code. Agreement without code still doesn't
> >> >>> help
> >> >>> us
> >> >>> very much.
> >> >>
> >> >> I'm going to return to Nathaniel's point - it is a highly valuable
> >> >> thing to set ourselves the target of resolving substantial
> discussions
> >> >> by consensus.   The route you are endorsing here is 'implementor
> >> >> wins'.   We don't need to do it that way.  We're a mature sensible
> >> >> bunch of adults who can talk out the issues until we agree they are
> >> >> ready for implementation, and then implement.  That's all Nathaniel
> is
> >> >> saying.  I think he's obviously right, and I'm sad that it isn't as
> >> >> clear to y'all as it is to me.
> >> >>
> >> >> Best,
> >> >>
> >> >> Matthew
> >> >>
> >> >
> >> > Everyone, can we please not do this?! I had enough of adults doing
> >> > finger
> >> > pointing back over the summer during the whole debt ceiling debate.  I
> >> > think
> >> > we can all agree that we are better than the US congress?
> >> >
> >> > Forget about rudeness or decision processes.
> >> >
> >> > I will start by saying that I am willing to separate ignore and
> absent,
> >> > but
> >> > only on the write side of things.  On read, I want a single way to
> >> > identify
> >> > the missing values.  I also want only a single way to perform
> >> > calculations
> >> > (either skip or propagate).
> >> >
> >> > An indicator of success would be that people stop using NaNs and magic
> >> > numbers (-9999, anyone?) and we could even deprecate nansum(), or at
> >> > least
> >> > strongly suggest in its docs to use NA.
> >>
> >> Well, I haven't completely made up my mind yet, will have to do some
> >> more prototyping and playing (and potentially have some of my users
> >> eat the differently-flavored dogfood), but I'm really not very
> >> satisfied with the API at the moment. I'm mainly worried about the
> >> abstraction leaking through to pandas users (this is a pretty large
> >> group of people judging by # of downloads).
> >>
> >> The basic position I'm in is that I'm trying to push Python into a new
> >> space, namely mainstream data analysis and statistical computing, one
> >> that is solidly occupied by R and other such well-known players. My
> >> target users are not computer scientists. They are not going to invest
> >> in understanding dtypes very deeply or the internals of ndarray. In
> >> fact I've spent a great deal of effort making it so that pandas users
> >> can be productive and successful while having very little
> >> understanding of NumPy. Yes, I essentially "protect" my users from
> >> NumPy because using it well requires a certain level of sophistication
> >> that I think is unfair to demand of people. This might seem totally
> >> bizarre to some of you but it is simply the state of affairs. So far I
> >> have been successful because more people are using Python and pandas
> >> to do things that they used to do in R. The NA concept in R is dead
> >> simple and I don't see why we are incapable of also implementing
> >> something that is just as dead simple. To we, the scipy elite let's
> >> call us, it seems simple: "oh, just pass an extra flag to all my array
> >> constructors!" But this along with the masked array concept is going
> >> to have two likely outcomes:
> >>
> >> 1) Create a great deal more complication in my already very large
> codebase
> >>
> >> and/or
> >>
> >> 2) force pandas users to understand the new masked arrays after I've
> >> carefully made it so they can be largely ignorant of NumPy
> >>
> >> The mostly-NaN-based solution I've cobbled together and tweaked over
> >> the last 42 months actually *works really well*, amazingly, with
> >> relatively little cost in code complexity. Having found a reasonably
> >> stable equilibrium I'm extremely resistant to upset the balance.
> >>
> >> So I don't know. After watching these threads bounce back and forth
> >> I'm frankly not all that hopeful about a solution arising that
> >> actually addresses my needs.
> >
> > But Wes, what *are* your needs? You keep saying this, but we need
> examples
> > of how you want to operate and how numpy fails. As to dtypes, internals,
> and
> > all that, I don't see any of that in the current implementation, unless
> you
> > mean the maskna and skipna keywords. I believe someone on the previous
> > thread mentioned a way to deal with that.
> >
> > Chuck
> >
>
> Here are my needs:
>
> 1) How NAs are implemented cannot be end user visible. Having to pass
> maskna=True is a problem. I suppose a solution is to set the flag to
> true on every array inside of pandas so the user never knows (you
> mentioned someone else had some other solution, i could go back and
> dig it up?)
>

I believe it was Eric Firing who mentioned that he raised this question
during development and Mark offered a potential solution. What ever that
solution was, we should take a look at implementing it.

> 2) Performance: I can't accept more than say 2x overhead in floating
> point array operations (binary ops or reductions). Last time I checked
> we were a long way away from that
>
>
Known problem, and probably fixable by pushing things down into the inner
ufunc loops. What we have at the moment is a prototype for testing the API
and that is what we need feedback on.

> 3) Implementation of NA-aware algorithms in Cython. A lot of pandas is
> about moving data around. Bit patterns would make life a lot easier
> because the code wouldn't have to change (much). But with masked
> arrays I'll have to move both data and mask values. Not the end of the
> world but is just the price you pay, I guess.
>
>
Agree that this is a problem, along with memory usage. One solution is to
have a way to translate to bit patterns for export/import. Note that in the
wild some data sets come with separate masks, sometimes several for
different conditions, so the current implementation would work better for
those. We need to support several options here.

> Things in R are a bit simpler re: bit patterns because there's only
> double, integer, string (character), and boolean dtypes, whereas NumPy
> has the whole C type hierarchy. So I can appreciate that doing bit
> patterns across all the dtypes would be really hard.
>
>
We could maybe limit it to float types, strings, and booleans, maybe dates
also. I think integers are problematical, for instance a uint8 255 turns up
in 8 bit images and means saturated, not missing.

> In any case, I recognize that the current implementation will be
> useful to a lot of people, but it may not meet my performance and
> usability requirements. As I said, the solution I've cooked up has
> worked well so far, and since it isn't a major pain point I may just
> adopt the "ain't broke, don't fix" attitude and focus my efforts on
> building new features. "Practicality beats purity", I suppose
>
>
That's perfectly reasonable. It would still help if you gave examples of use
cases where the current API doesn't work for you. I don't see much
difference between code using nan's and code using NA at the API level apart
from the maskna/skipna keywords.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111029/9663c6ef/attachment.html>