[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Tue Sep 10 21:32:12 EDT 2013

Brian et al,

Brian I hope your move/travel/etc was as pleasant as such things can be.

On Fri, Jul 12, 2013 at 9:21 AM, Brian Granger <ellisonbg at gmail.com> wrote:

> Gabriel,
> <snip>
>
> Great, let's talk in Sept. to figure out a time that would work.
>

I'm still quite interested in meeting with you guys. Somewhere near the end
of the month would be best for me, but I'm pretty flexible.

> <snip>
>
>  > Branching/DAG notebooks allow a single document to encompass the
> research
> > you did, while providing easy access to various views corresponding to
> the
> > generation of intermediate, alternative, and final results.
> >
> > These more complex notebooks allow the viewer to ask and answer important
> > questions such as "What else did (s)he try here?" and potentially even
> "Why
> > did (s)he choose this particular analysis strategy?". These questions
> can be
> > answered in the text or external supplementary materials in a linear
> > notebook, but this is a significant barrier to reproducibility of the
> > research process (as opposed to the analysis results).
>
> I can see that, however, I think the pure alt cells lack a critical
> feature.  They treat all branches as being equally important.  In
> reality, the branch that is chosen as the "best" one will likely
> require further analysis and discussion that that other branches
> don't.  Putting the different branches side by side makes it a little
> like "choose your own adventure" - when in reality, the author of the
> research want to steer the reader along a very particular path.  The
> alternative paths maybe useful to have around, but they should be be
> given equal weight as the "best" one.  But, maybe it is just
> presentation and can be accounted for in descriptive text.
>

This is very true. My current thinking calls for both a "default"
designation and a "most recently selected/run" designation, which I believe
deals with the valid concern you raise above.

There are also other important designations for "branch types". The most
notable/easily explained of these is the concept of a "terminal" branch,
which is a branch that records important computations (and prose), and
which a viewer of the notebook  (be it the original author, a reviewer, a
student, or someone looking to extend the work) may want to look at or run,
but whose output is not compatible with the subsequent computations. This
arises most commonly when one analysis strategy is implemented and pursued,
but ultimately abandoned  (hopefully for good reasons, and with this we can
check!) in favor of a different final strategy which produces incompatible
output. The subsequent code then makes assumptions about the output which
are compatible with the final strategy computations, but not the original
strategy ones. A way to gracefully deal with this case is important for any
document/processing/rendering system attempting to pursue these concepts.

There are other cases that arise with these documents, but I will omit a
detailed discussion of them and what I think should be done to support them
here, as that would make this mail burdensomely long and it is not my
primary message.

I will note, though, that while I agree that the final/core/whathaveyou and
secondary/informative/archival branches should not be indistinguishable, it
is important for my usecase that they be easily accessible when the reader
wants to in both interactive (notebook) and headless (nbconvert) modes.

> <snip>
>
> > As a practical/UI standpoint unselected branches can be hidden almost
> > entirely (in theory, not currently in my PoC :p), resulting in a view
> > equivalent to (any) the only view offered by a linear notebook. This
> means
> > that from a viewer (and author since a straight line IS a DAG and nesting
> > isn't forced) standpoint, what I'm describing is in essense a strict
> > extension of what the notebook does now, rather than a change.
>
> I would be *more* interested in alt-cell approaches that present the
> notebook as a linear entity in all cases, but that has the alt-cell
> logic underneath.  For example, what about the following:
>
> * A user writes the different N alt cells in linear sequence
> * The result is a purely linear notebook where one of the N cells should
> be run.
> * We write a JavaScript plugin for the notebook that does a couple of
> things:
>
> 1. It provides a cell toolbar for marking those cells as members of an
> alt-set.  This would simple modify the cell level metadata and allow
> the author to provide titles of each alt-member.
>

What about branching that is 2 or more levels deep? That happens naturally
with my approach but sounds difficult/annoying to keep track of in the one
you are describing.

> 2. It provides the logic for building a UI for viewing one of the
> alt-set members at a time.  It could be as simple as injecting a drop
> down menu that shows one and hides the rest.
>

I have an ugly but functional version of this now in my implementation.

>
> * This plugin could simple walk the notebook cells and find all the
> alt-cell sets and build this supplementary UI.
> * This plugin could also have settings that allow the author to select
> the "best" member of the alt-set.
> * nbconvert Transformers could use the cell level metadata to export
> the notebook in different formats.
>
> As I write about this - I think this would be extremely nice, and it
> would not be difficult to write at all.  Because of how our JavaScript
> plugins work, it could be developed outside IPython initially.  The
> question of inclusion in the official code base could be handled
> later.  Honestly, this approach should be much easier than the work
> you have already done.
>

Well, editing the notebook once it exists in this form seems like it would
be much less fun, in terms of adding new cells.

What you're describing is also much more onerous for the author. With what
I have now, you declare a cell to be an altset or task and everything just
sort of works. New cells are inserted in the right places, cells trivially
know who their parents are, etc.

If I understand you correctly, the author would have to write all the
alternatives in a big linear document (not fun or easy to test, see
discussion below) and then click a bunch of buttons to manually select what
cells go in which alternate. That is a much larger cognitive burden on the
author (as well as probably being really annoying...).

>
> Best of all the resulting notebooks would remain standard linear
> notebooks that could be shared today on nbviewer, etc.  It would just
> work.
>

Respectfully, this is actually the fatal flaw of this approach IMO, both in
this case and in other cases where a JS plugin/extension uses the metadata
approach to fundamentally modify *behavior* (as opposed to aestethics/UI)
of the IPython Notebook.

The issue, stated in the context of the nesting/alts/etc cells extension,
is that a notebook that has branching/alternates *requires* that they be
understood as such, rather than simply benefiting from it.

The ability to distribute notebooks I write and have them work properly is
entirely core to my usecase for IPython. If I can't do so, what I
personally can get IPython or IPython notebooks to do on my own machine is
not something I have any real interest in. Now you may be thinking to
yourself "But Gabe, no one is using your fork so you can't do that now with
your implementation anyway". That is true, but if someone without my fork
installed manages to get their hands on a notebook which uses the nesting
features, it will break when they try to load it.

If I create an extension as you are describing, create a complex notebook
using it, and someone without the plugin installed finds it, downloads it,
and runs it, it will *run fine and happily give them incorrect results
without even noticing the extra bits I stuck in the metadata*.

The core issue here is that running a notebook with branching as a linear
notebook by executing each of the branches in sequence is actually
erroneous and will produce undefined, untrustworthy, and likely incorrect,
behavior and output. The reason for this is that branches/alternatives are
assumed to be mutually exclusive by the computational model, and can alter
objects in-place in manners that can have unintended cumulative effects.

As a very simple example consider branches which handle outliers in a
certain variable by modifying the variable in-place and trimming its
values  by .1, 1, 5, and 10%, respectively,  using quantiles and then
consider what would happen if these branches were all run in an arbitrary
order.

It is easy to see that the outcome from running all the branches (which is
what will silently happen if the notebook is treated as a standard linear
notebook because the plugin is not being used) does not reflect any of the
choices intended by the author and more complex situations could be
difficult to predict at all without sitting down and thinking about it.

As such, I would not be comfortable distributing branching notebooks using
the extension mechanism as I understand it to exist now because a) I feel
it indirectly damages the type of scientific reprodicibility and result
trustworthiness I seek to advance, and b) I don't want to spend all my time
fielding angry emails/bugreports from notebook authors who sent their
notebooks to collaborators who didn't have the plugin installed.

>
> <snip>
>
> > Consider the example of classifying new data based on a training set via
> > KNN, SVM, and GLM approaches. These approaches all need different sets of
> > parameters, return different types of objects as the output of the
> fitting
> > function, may have subtley different behaviour when being used for
> > prediction, etc.
>
> Yep, that is the big challenge with the branching idea in general.  It
> is not always true that the members of the alt sets can be swapped
> out.
>

And under the model I am envisioning, that is actually an informative  and
queriable feature, rather than a drawback. See my discussion above
regarding terminal branches.

>
> <snip>
>
> I hope you can see that I really like the general idea and think the
> usage cases you are describing are really important.  I think I can
> speak for the project in saying that we want the notebook to be useful
> for things like this.  But I think our abstractions are important
> enough that we make every attempt to see how we can do these while
> leveraging our existing abstractions.  This is partially a question
> about implementation, but also partly a question about how the new
> features are thought about.  The reason we don't like to break
> abstractions for new features is that we have found an interesting
> relationship between abstraction breaking and new features.  We have
> found that when a new feature/idea breaks a core abstraction that we
> have thought about very carefully, it is usually because the feature
> has not been fully understood.  Time and time again, we have found
> that when we take the time to fully understand the feature, it usually
> fits within our abstractions beautifully and is even much better that
> we ever imagined it could be.
>
> The plugin idea above is a perfect example of this.  By preserving the
> abstractions the new feature itself a multiplication of even new
> functionality:
>
> * The resulting notebooks can still be version controlled.  This means
> that the different alt-cell can be thrown into git and when we develop
> a visual diff tool for notebooks, they will *just work*.
>

I don't really understand this point. I have numerous fork-based non-linear
notebooks under version control.

Also, when you have a visual diff tool, it will successfully do
*something*when given a linear+metadata branching notebook, but
whether that something
would be to deliver the information required to understand changes to
non-linear notebooks  is less clear (and seems somewhat unlikely).

> * The notebooks can immediately leverage the abstractions we have put
> into place for converting notebooks to different formats.  You could
> write custom transformers to present the notebook in a reveal.js
> giving alt-cells special treatment.
>

I could write custom transformers, this is true, but the default behavior
would treat the notebook as if it actually were linear (instead of just
being stored that way) which is problematic.

> * All of this can be done, and into the hands of user, without going
> through those overly conservative IPython developers ;-)
> * It will just work with nbviewer as well.
>

Again, I disagree. It would *display* in nbviewer, but not work, in that
the display would be actively misleading regarding what the notebook would
do when executed properly.

>  * It provides a cleanly abstracted foundation for other people to build
> upon
>

I agree that this is important, but it is not clear to me that it would be
more true in the case that I created the extension via custom JS than it
would if nesting were supported in the actual ipynb format and core
notebook mechanisms.

>
> In summary, we are trying to build an architecture that allows a few
> simple abstractions (we actually don't have that many!) to combine in
> boundless ways to create features we never planned on, but that "just
> work".
>

I agree that the customjs + metadata extensions approach is very powerful
and almost infinitely versatile. I think it is great for extensions which
change appearance/rendering/UI details of how the notebook behaves.

As far as I can see, however,  it has some signficant problems with regard
to extensions which fundamentally change non-rendering behavior of
notebooks (please correct me if I'm wrong), namely:

   - There is no guarantee that notebooks authored using an extension which
   alters fundamental behaviors will work or visibly fail in the absence of
   that extension
   - There is no way for an individual notebook to require a particular
   extension
   - There is no way to ensure that two extensions are compatible with
   each-other
   - There is no standard/unified way for end-users to install extensions
   - There is no way for users to determine which extensions they have

The first point is not true of extensions which exclusively affect
rendering and UI, making the rest of the points minor nuisances rather than
critical issues.
Looking forward to hearing your (further) thoughts about this stuff and
hopefully meeting you in person soon.

~G

-- 
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130910/8e576523/attachment.html>