[IPython-dev] Some new cell types for describing data analyses in IPy. Notebook

Brian Granger ellisonbg at gmail.com
Sat Jul 13 19:40:27 EDT 2013


> I will think about the rest of what you said and suggested and we can take
> it back up when you get back.

Great.

> Have a good trip/move

Thanks,

Brian

> ~G
>
>>
>>
>> > Note that sometimes the viewer is you coming back to an analysis after
>> > some
>> > span of time so that the reasoning behind your decisions is no longer
>> > fresh.
>>
>> Yes, this is an extremely common - if not the most common - usage case of
>> all...
>>
>> > As a practical/UI standpoint unselected branches can be hidden almost
>> > entirely (in theory, not currently in my PoC :p), resulting in a view
>> > equivalent to (any) the only view offered by a linear notebook. This
>> > means
>> > that from a viewer (and author since a straight line IS a DAG and
>> > nesting
>> > isn't forced) standpoint, what I'm describing is in essense a strict
>> > extension of what the notebook does now, rather than a change.
>>
>> I would be *more* interested in alt-cell approaches that present the
>> notebook as a linear entity in all cases, but that has the alt-cell
>> logic underneath.  For example, what about the following:
>>
>> * A user writes the different N alt cells in linear sequence
>> * The result is a purely linear notebook where one of the N cells should
>> be run.
>> * We write a JavaScript plugin for the notebook that does a couple of
>> things:
>>
>> 1. It provides a cell toolbar for marking those cells as members of an
>> alt-set.  This would simple modify the cell level metadata and allow
>> the author to provide titles of each alt-member.
>> 2. It provides the logic for building a UI for viewing one of the
>> alt-set members at a time.  It could be as simple as injecting a drop
>> down menu that shows one and hides the rest.
>>
>> * This plugin could simple walk the notebook cells and find all the
>> alt-cell sets and build this supplementary UI.
>> * This plugin could also have settings that allow the author to select
>> the "best" member of the alt-set.
>> * nbconvert Transformers could use the cell level metadata to export
>> the notebook in different formats.
>>
>> As I write about this - I think this would be extremely nice, and it
>> would not be difficult to write at all.  Because of how our JavaScript
>> plugins work, it could be developed outside IPython initially.  The
>> question of inclusion in the official code base could be handled
>> later.  Honestly, this approach should be much easier than the work
>> you have already done.
>>
>> Best of all the resulting notebooks would remain standard linear
>> notebooks that could be shared today on nbviewer, etc.  It would just
>> work.
>>
>> Are you interested in taking a shot at this?  I think that would be
>> awesome.
>>
>> >>
>> >> BUT, I completely agree that the notebook does not handle certain
>> >> types of branching very well.  Where the notebook starts to really
>> >> suck is for longer analyses that you want to repeat for differing
>> >> parameters or algorithms.  You talk more about this usage case below
>> >> and we have started to think about how we would handle this.  Here are
>> >> our current thoughts:
>> >>
>> >> It would be nice to write a long notebook and then add metadata to the
>> >> notebook that indicates that some variables are to be treated as
>> >> "templated" variables.  Then we would create tools that would enable a
>> >> user to run a notebook over a range of templates:
>> >>
>> >> for x in xvars:
>> >>   for y in yvars:
>> >>     for algo in myalgos
>> >>     run_notebook('MyCoolCode', x, y, algo)
>> >>
>> >> The result would be **something** that allows the user to explore the
>> >> parameter space represented.  A single notebook would be used as the
>> >> "source" for this analysis and the result would be the set of all
>> >> paths through the notebook.  We have even thought about using our
>> >> soon-to-be-designed interactive widget architecture to enable the
>> >> results to be explored using different UI controls (sliders, etc) for
>> >> the xvar, yvar, algos.  This way you could somehow "load" the
>> >> resulting analysis into another notebook and explore things
>> >> interactively - with all of the computations already done.
>> >>
>> >
>> > This is a very powerful and exciting use-case. In fact it is one I am
>> > investigating myself in the context of a different project unrelated to
>> > IPython notebook. I call the set of results generated by such repeated
>> > runs
>> > with different input sets (ie paths through the document) the
>> > "robustness
>> > set" of the notebook with respect to the particular output variable
>> > being
>> > investigated.
>>
>> Yes, this is a sort to batch mode for the notebook.
>>
>> > The key here is that the robustness we are talking about is not only
>> > with
>> > respect to data/tuning parameters, but also with respect to the
>> > decisions/choices made during the analysis process itself. These
>> > decisions
>> > are often the difference between valid and invalid conclusions, but are
>> > rarely talked about during discussions about reproducible
>> > research/science
>> > AFAIK (I'd love to be wrong about that, even if it would make me look
>> > silly/foolish here).
>> >
>> > The DAG conceptual model buys us a lot here too though. Instead of
>> > having to
>> > run the entire notebook, you can calculate all possible paths through
>> > the
>> > DAG for any arbitrary (connected) starting and ending points. So we can
>> > rerun only pieces of  large notebooks to investigate any variable/plot
>> > regardless of whether it constitutes a final result of the
>> > notebook/analsyis.
>>
>> Yes, this type of analysis could also be done by the JavaScript plugin
>> approach above.
>>
>> >
>> >>
>> >> We have other people interested in this type of workflow and it can
>> >> all be done within the context of our existing linear notebook model.
>> >> It is just assembling the existing abstractions in different ways.
>> >>
>> >
>> > That is a plus. There is what I consider to be a pretty major drawback
>> > to
>> > this approach though.
>> >
>> > It is easy to see how this would work in the case of variables
>> > representing
>> > individual number/string/boolean valued parameters without much
>> > perturbation
>> > of the code.
>> >
>> > Trying to write an analysis script that can graciously handle
>> > substantially
>> > dissimilar analysis methods, on the other hand, is more problematic. We
>> > can
>> > do it, of course, but at that point we are moving much more into the
>> > realm
>> > of a program rather than an analysis script.
>>
>> Yes, definitely.
>>
>> > Consider the example of classifying new data based on a training set via
>> > KNN, SVM, and GLM approaches. These approaches all need different sets
>> > of
>> > parameters, return different types of objects as the output of the
>> > fitting
>> > function, may have subtley different behaviour when being used for
>> > prediction, etc.
>>
>> Yep, that is the big challenge with the branching idea in general.  It
>> is not always true that the members of the alt sets can be swapped
>> out.
>>
>> > The abstractions necessary to deal with these differences are likely in
>> > my
>> > opinion to be highly costly in terms of how easy it is for readers of
>> > the
>> > notebook to follow and understand what the code is doing.
>> >
>> > With actual branching, the code in each branch is exactly the same as if
>> > it
>> > were in a normal linear notebook which implemented only that one branch,
>> > making it much more likely to be straightforward and easy to read.
>>
>> But I think the same issue exists with any approach to branching
>> right?  I am thinking the scripted notebook could have the same type
>> of API - the important point is that the templated variables, while
>> simple types, could trigger different code paths.
>>
>> algo = 0 # a template variable
>>
>> if algo == 0:
>>   # alt-cell #1
>> elif algo == 1:
>>   # alt-cell #3
>> ...
>>
>> This is not pretty but it would work...
>>
>> > One of my targeted use-cases is publications which can more accurately
>> > convey the research which was done while still able to offer the clarity
>> > of
>> > focus of what we do now, so I think that is quite important. YMMV.
>> >
>> > And now the sticking point.
>> >>
>> >>
>> >> <snip>
>> >>
>> >> Q: does the new feature violate important abstractions we have in
>> >> place.
>> >>
>> >> If the answer is no, then we do our normal job of considering the
>> >> costs of adding the feature versus the benefits.
>> >>
>> >> If the answer is yes, then we *stop*.
>> >
>> >
>> > I really do appreciate the IPython team's position. I think there is
>> > some
>> > relevant nuance involved in this particular case, however, which makes
>> > the
>> > does it change? yes:no test overly coarse. I attempt to make my case for
>> > this below.
>> >
>> > I think the answer to the questions "does this new feature violate
>> > important
>> > abstractions?" and "is it impossible/burdensomely difficult to alter
>> > important existing abstractions in a way that supports this feature
>> > without
>> > affecting the uses of the abstraction?" , may be different here, despite
>> > being the same in the overwhelming majority of cases.  And I would argue
>> > the
>> > second test offers identical protections as the first against the
>> > various
>> > pitfalls of making major changes to large projects willie-nillie (which
>> > i
>> > assure you I do understand are very real).
>> >
>> > I'm not advocating a dramatic about-face on the issue complete with
>> > parade
>> > and skywriting that "IPython is pursuing an exciting new thing starting
>> > today!". I do, however,  think it is perhapsworth consideration at a
>> > somewhat narrower and more immediate scale than it would be otherwise.
>>
>> I hope you can see that I really like the general idea and think the
>> usage cases you are describing are really important.  I think I can
>> speak for the project in saying that we want the notebook to be useful
>> for things like this.  But I think our abstractions are important
>> enough that we make every attempt to see how we can do these while
>> leveraging our existing abstractions.  This is partially a question
>> about implementation, but also partly a question about how the new
>> features are thought about.  The reason we don't like to break
>> abstractions for new features is that we have found an interesting
>> relationship between abstraction breaking and new features.  We have
>> found that when a new feature/idea breaks a core abstraction that we
>> have thought about very carefully, it is usually because the feature
>> has not been fully understood.  Time and time again, we have found
>> that when we take the time to fully understand the feature, it usually
>> fits within our abstractions beautifully and is even much better that
>> we ever imagined it could be.
>>
>> The plugin idea above is a perfect example of this.  By preserving the
>> abstractions the new feature itself a multiplication of even new
>> functionality:
>>
>> * The resulting notebooks can still be version controlled.  This means
>> that the different alt-cell can be thrown into git and when we develop
>> a visual diff tool for notebooks, they will *just work*.
>> * The notebooks can immediately leverage the abstractions we have put
>> into place for converting notebooks to different formats.  You could
>> write custom transformers to present the notebook in a reveal.js
>> giving alt-cells special treatment.
>> * All of this can be done, and into the hands of user, without going
>> through those overly conservative IPython developers ;-)
>> * It will just work with nbviewer as well.
>> * It provides a cleanly abstracted foundation for other people to build
>> upon
>>
>> In summary, we are trying to build an architecture that allows a few
>> simple abstractions (we actually don't have that many!) to combine in
>> boundless ways to create features we never planned on, but that "just
>> work".
>>
>> The upside of this, is that when we have encountered features that are
>> important to us that really do require us to break or re-vision core
>> abstractions - we gladly undertake this work.  Mainly because we feel
>> that the new abstractions will be even more powerful.
>>
>> >>
>> >> <snip>
>> >>
>> >>
>> >> Thinking about your proposed feature from this perspective: both the
>> >> task cells and alt cells introduce hierarchy and nesting into the
>> >> notebook.  This breaks our core abstraction that cells are not nested.
>> >>  In Jan-Feb our core development team had a discussion about this
>> >> abstraction exactly.  We decided that we definitely don't want to move
>> >> in the direction of allowing nesting in the notebook.  Because of this
>> >> we are in the process of removing the 1 level of nesting our notebook
>> >> format currently has, namely worksheets.  So for us, it is not just
>> >> about complexity - it is about breaking the abstractions.
>> >
>> >
>> > I do understand this position. I'd like to think I am bringing up points
>> > not
>> > raised during that meeting, but whether or not that is the case
>> > abstractions
>> > ARE important.
>> >
>> > I guess I am/was thinking about the abstraction in place in IPython
>> > notebook
>> > a bit differently than you are describing it.
>> >
>> > For the next few paragraphs: process == (render|transform|execute|*)
>> >
>> > In my mind the abstraction/computational model is that a notebook is an
>> > ordered set of cells and to process the notebook you simply go through
>> > the
>> > cells in order and process them. What process means is dependent on the
>> > type
>> > of cell, and there are various pieces of code in various places (mostly
>> > the
>> > frontends and nbconvert AFAIK) that know how to handle each cell type.
>> >
>> > Under this formulation the change in abstraction is actually pretty
>> > small.
>> > The only addition is the statement that code which processes cells is
>> > responsible for initiating/handling the processing of any child cells
>> > those
>> > cells contain. The easy easiest example of this is the execute method on
>> > my
>> > task cells, which simply loops through each of its children and (if
>> > applicable) calls their execute method.
>> >
>> > With this change we still have a notebook defined as an ordered set of
>> > (top
>> > level) cells, and we can still process a notebook by stepping through
>> > each
>> > of them in order and processing that cell.
>> >
>> > Some changes to the concept of next/previous cells and cell position
>> > (for
>> > positional insertion, etc) were required and cells must be aware of
>> > their
>> > direct parent (which will be either a cell or the notebook itself), but
>> > I
>> > would argue these aren't actually important attributes of the
>> > abstraction
>> > itself and the changes were actually fairly narrow and (AFAICS) pretty
>> > painless and straightforward after some careful though/planning.
>>
>> This is an interesting way of thinking about nesting that I had not
>> thought about.
>>
>> >
>> >>
>> >> The reason that these abstractions are so important is that they
>> >> provide powerful foundations for us to build on.  One place the
>> >> "notebook as a linear sequence of cell" abstraction comes into play is
>> >> in our work on nbconvert that will appear in 1.0 in the next few
>> >> weeks.  This allows to to convert notebooks very easily to a number of
>> >> different formats.
>> >
>> >
>> > I haven't tackled nbconvert yet on my experimental fork, but I fully
>> > intend
>> > to as I agree entirely that the ability to generate things like linear
>> > pdfs
>> > and other static views is utterly crucial. The fact that a notebook with
>> > branches can generate a pdf that looks like it came from a linear
>> > notebook
>> > (ie the "static article" view) is a *major* selling point/core feature
>> > of
>> > what I'm trying to do with branching notebooks. It is key that people be
>> > able to meet the needs they are meeting now; if we can't, meeting the
>> > more
>> > nebulous needs they aren't isn't likely to save us (me) from
>> > irrelevance.
>> >
>> > Under my alternate description of the computational model described
>> > above
>> > nbconvert will behave pretty much as it always has: step through the
>> > notebook and process the cells in order into whatever format you are
>> > targeting. The one exception is the cells processing their children, but
>> > the
>> > scale of this change is not particularly large for the specific types of
>> > nesting I'm going for.
>> >
>> > Tasks would likely simply render their children without making any mark
>> > themselves in most cases, while altsets would do the same, but only for
>> > the
>> > "active" branch. This involves a bit of looping and a bunch of calls to
>> > existing code that knows how to transform the existing (core) cell
>> > types,
>> > but really not much else.
>> >
>> >
>> >
>> >>
>> >>  The other place this abstraction comes into play
>> >> is in our keyboard shortcuts.  We are striving for the notebook to be
>> >> usable for people who dont' touch the mouse (your traditional vi/emacs
>> >> users).  Nesting makes that very difficult.
>> >
>> >
>> > I admit this one is tougher, though I've done some small amount thinking
>> > about it (currently hitting "down" on a container cell enters it while
>> > hitting down on the last cell in a container navigates to the cell
>> > "after"
>> > the container in my PoC).
>> >
>> > I think this is surmountable though, and worth the effort if it were the
>> > only thing holding IPython notebook back from offering a
>> > change/alternative/"fix" to how we talk about research and what we can
>> > do
>> > with the documents we use to describe it.
>> >
>> > Wow that was a lot of text. Thanks for making it all the way to the end!
>>
>> I made it!
>>
>> Cheers,
>>
>> Brian
>>
>> > ~G
>> >
>> >
>> >> <snip>
>> >>
>> >
>> >
>> > --
>> > Gabriel Becker
>> > Graduate Student
>> > Statistics Department
>> > University of California, Davis
>> >
>> > _______________________________________________
>> > IPython-dev mailing list
>> > IPython-dev at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/ipython-dev
>> >
>>
>>
>>
>> --
>> Brian E. Granger
>> Cal Poly State University, San Luis Obispo
>> bgranger at calpoly.edu and ellisonbg at gmail.com
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
>
>
> --
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>



--
Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu and ellisonbg at gmail.com



More information about the IPython-dev mailing list