[Numpy-discussion] consensus (was: NA masks in the next numpy release?)

Fri Oct 28 17:16:11 EDT 2011

On Tue, Oct 25, 2011 at 2:56 PM, Travis Oliphant <oliphant at enthought.com> wrote:
> I think Nathaniel and Matthew provided very
> specific feedback that was helpful in understanding other perspectives of a
> difficult problem.     In particular, I really wanted bit-patterns
> implemented.    However, I also understand that Mark did quite a bit of work
> and altered his original designs quite a bit in response to community
> feedback.   I wasn't a major part of the pull request discussion, nor did I
> merge the changes, but I support Charles if he reviewed the code and felt
> like it was the right thing to do.  I likely would have done the same thing
> rather than let Mark Wiebe's work languish.

My connectivity is spotty this week, so I'll stay out of the technical
discussion for now, but I want to share a story.

Maybe a year ago now, Jonathan Taylor and I were debating what the
best API for describing statistical models would be -- whether we
wanted something like R's "formulas" (which I supported), or another
approach based on sympy (his idea). To summarize, I thought his API
was confusing, pointlessly complicated, and didn't actually solve the
problem; he thought R-style formulas were superficially simpler but
hopelessly confused and inconsistent underneath. Now, obviously, I was
right and he was wrong. Well, obvious to me, anyway... ;-) But it
wasn't like I could just wave a wand and make his arguments go away,
no matter how annoying and wrong-headed I thought they were... I could
write all the code I wanted but no-one would use it unless I could
convince them it's actually the right solution, so I had to engage
with him, and dig deep into his arguments.

What I discovered was that (as I thought) R-style formulas *do* have a
solid theoretical basis -- but (as he thought) all the existing
implementations *are* broken and inconsistent! I'm still not sure I
can actually convince Jonathan to go my way, but, because of his
stubbornness, I had to invent a better way of handling these formulas,
and so my library[1] is actually the first implementation of these
things that has a rigorous theory behind it, and in the process it
avoids two fundamental, decades-old bugs in R. (And I'm not sure the R
folks can fix either of them at this point without breaking a ton of
code, since they both have API consequences.)

--

It's extremely common for healthy FOSS projects to insist on consensus
for almost all decisions, where consensus means something like "every
interested party has a veto"[2]. This seems counterintuitive, because
if everyone's vetoing all the time, how does anything get done? The
trick is that if anyone *can* veto, then vetoes turn out to actually
be very rare. Everyone knows that they can't just ignore alternative
points of view -- they have to engage with them if they want to get
anything done. So you get buy-in on features early, and no vetoes are
necessary. And by forcing people to engage with each other, like me
with Jonathan, you get better designs.

But what about the cost of all that code that doesn't get merged, or
written, because everyone's spending all this time debating instead?
Better designs are nice and all, but how does that justify letting
working code languish?

The greatest risk for a FOSS project is that people will ignore you.
Projects and features live and die by community buy-in. Consider the
"NA mask" feature right now. It works (at least the parts of it that
are implemented). It's in mainline. But IIRC, Pierre said last time
that he doesn't think the current design will help him improve or
replace numpy.ma. Up-thread, Wes McKinney is leaning towards ignoring
this feature in favor of his library pandas' current hacky NA support.
Members of the neuroimaging crowd are saying that the memory overhead
is too high and the benefits too marginal, so they'll stick with NaNs.
Together these folk a huge proportion of the this feature's target
audience. So what have we actually accomplished by merging this to
mainline? Are we going to be stuck supporting a feature that only a
fraction of the target audience actually uses? (Maybe they're being
dumb, but if people are ignoring your code for dumb reasons... they're
still ignoring your code.)

The consensus rule forces everyone to do the hardest and riskiest part
-- building buy-in -- up front. Because you *have* to do it sooner or
later, and doing it sooner doesn't just generate better designs. It
drastically reduces the risk of ending up in a huge trainwreck.

--

In my story at the beginning, I wished I had a magic wand to skip this
annoying debate and political stuff. But giving it to me would have
been a bad idea. I think that's went wrong with the NA discussion in
the first place. Mark's an excellent programmer, and he tried his best
to act in the good of everyone in the project -- but in the end, he
did have a wand like that. He didn't have that sense that he *had* to
get everyone on board (even the people who were saying dumb things),
or he'd just be wasting his time. He didn't ask Pierre if the NA
design would actually work for numpy.ma's purposes -- I did.

You may have noticed that I do have some ideas for about how NA
support should work. But my ideas aren't really the important thing.
The alter-NEP was my attempt to find common ground between the
different needs people were bringing up, so we could discuss whether
it would work for people or not. I'm not wedded to anything in it. But
this is a complicated issue with a lot of conflicting interests, and
we need to find something that actually does work for everyone (or as
large a subset as is practical).

So here's what I think we should do:
  1) I will submit a pull request backing Mark's NA work out of
mainline, for now. (This is more or less done, I just need to get it
onto github, see above re: connectivity)
  2) I will also put together a new branch containing that work,
rebased against current mainline, so it doesn't get lost. (Ditto.)
  3) And we'll decide what to do with it *after* we hammer out a
design that the various NA-supporting groups all find convincing. Or
at least a design for some of the less controversial pieces (like the
'where=' ufunc argument?), get those merged, and then iterate
incrementally.

What do you all think?

And in any case, thanks for reading,
-- Nathaniel

[1] https://github.com/charlton/charlton
[2] For example, this is written into the Apache voting procedure:
https://www.apache.org/foundation/voting.html (it's the "code
modification" rules that are relevant). And as usual, Karl Fogel has
more useful discussion:
http://producingoss.com/en/consensus-democracy.html (see esp. the
"When to vote" section, which is entirely about how to avoid voting)