[IPython-dev] Thoughts on the notebook format for version control

Sat Nov 5 21:58:12 EDT 2011

Hi folks,

I wanted to start a discussion on the notebook format regarding its
suitability for version control.  I see the notebook format as the way
in which I'll likely keep (and hopefully many others too) most of my
research notes/work, and thereore it's important that it's as easy as
possible to version control notebooks and use them smoothly in a
version-controlled workflow.  Unfortunately right now, the format we
have simply doesn't fit that, mainly for two reasons:

1. The cell inputs (code and text) are stored as a single line in the
json format.  This means that virtually any edits anywhere in a cell
will immediately produce VC conflicts.  Furthermore, they are nearly
impossible to resolve by hand because you have to scan very long lines
by eye, and can only apply wholesale one version or the other.

2. The presence of outputs stored inside the file causes two separate problems:
a) The large binary blobs make the files often quite large.
b) Changes in the binary blobs can't really be inspected by hand, but
tend to easily cause conflicts.

To get a sense of the problem, here's the diff from a pull request
made on a simple (mostly for testing purposes) repo:

https://github.com/fperez/nipy-notebooks/pull/1/files

That diff is more or less useless: note the huge horizontal scroll
bar, and changes in inputs are impossible to understand.

So I think we need to find a solution.  This doesn't have to happen
necessarily right away, since we're trying to put 0.12 out; I think
it's OK if for now our format is mostly treated as a binary blob.  But
we do need to come up with a plan for the medium term.

Here's my proposal, with full credit going to Yarik who suggested the
idea of splitting outputs into a separate file.  There are basically
two changes against what we have now:

1. The notebook would *always* be split into two files, the .ipynb
containing only inputs, and a companion (say .ipynbo) file with all
outputs.  If an output file is not available or is detected to have
problems such as cell number mismatch, it is simply ignored (it can
always be recreated by rerunning the notebook.

2. All inputs would be stored in a json list of strings instead of a
single string.

With #1, one would naturally only commit to VC the ipynb file, leaving
the output ones to be always ignored.  People could obviously choose
to commit the output as well, at their own risk. #2 would make it much
easier to get line-by-line diffs of any input (code or text).

I think together, these two changes mostly solve the problems I've
encountered in practice so far.  I'm trying really hard to eat our own
dogfood by using these tools in actual, everyday research work, so
that we see the problems first.  And while I think the notebook is
reaching a point where it's a great working environment (even if we
have a ton of ideas for improvements already and things we know need
fixing), it's clear now to me that we fail pretty badly as a
version-controllable format.

I realize that implementing something like this will add non-trivial
complexity to the format read/write code in a number of places, so if
anyone sees a simpler solution to the problem, we're all ears.  But we
do need to figure out how to make the notebooks first-class citizens
in a VC world; the (effectively) opaque binary blobs they are now just
won't cut it in the long run.

Thoughts, ideas?

Cheers,

f