[IPython-dev] Indexing Jupyter Notebooks

Mon Sep 8 15:19:31 EDT 2014

Hey IPythonistas, Jovians, and friends,

Before our IPython development meeting this past week (on the plane to SFO
actually), we started experimenting with indexing Jupyter notebooks using
ElasticSearch.

The notebooks are JSON which ElasticSearch can handle directly:

# pip install elasticsearch requests
import elasticsearch
import requests

es = elasticsearch.Elasticsearch([{'host':"127.0.0.1" 'port':9200])

resp = requests.get("http://bit.ly/lorenzsystem")
notebook_contents = resp.json()

resp = es.index(index='notebooks',
                doc_type='ipynb',
                body=notebook_contents)

The current setup in the notebook viewer can be viewed (or hacked on) via
https://github.com/ipython/nbviewer/blob/568ae74c9b8a74f7b24cac271b6b1fbdd1c42643/nbviewer/index.py#L34-L48
.

Announcing this at the development meetings got a lot of discussion going.

*Potential Utility of Indexing Notebooks*

   - Searching notebooks with plain old raw text search
   - Searching notebooks using the actual notebook structure (by language,
   cell types, etc.)
   - Streaming feed of notebooks being "created"
   - Galleries of plots
   - Using code cells to build models for code completion
   - Trending notebooks (* don't actually need a full notebook index for
   this)
   - Corpus of structured scholarly works (for some value of scholarly)

Though Notebook Viewer only handles publicly accessible notebooks,
including secret gists, this does bring up some privacy concerns.

*Privacy Concerns*

What notebooks can we expose through a search interface or API?

Do we need to obey the robots.txt for each site?

What about notebooks on GitHub? Do we use some robots.txt scheme? They're
searchable on GitHub already.

There was even more brought up during the dev meetings, but we'd like to
hear from everyone within the IPython/Jupyter community.

*Current Plans*

Currently we're indexing notebooks to see what running elasticsearch nodes
is like and determine our infrastructure needs. We're not exposing the data
or the elasticsearch node for the time being. In fact, the current node
that is indexing will be torn down shortly. It was a good experiment, but
needs tweaking to be done in practice.

We've at least determined that we need separate storage of

   - Embedded images
   - Static widgets

We're currently thinking these should be referenced by UUID and either go
to an object store (for us, Rackspace CloudFiles) or a database.

Dumping an entire notebook as is into the notebook index actually doesn't
lend well to searching the notebook natively (we would need to index
specific fields, not make nested queries). I really need to understand
ElasticSearch more fully as well, as I only started with it last week.

Would love to hear your thoughts on all of this.

Cheers,

Kyle Kelley

P.S. See http://en.wikipedia.org/wiki/Jovian_(fiction) for why I said
Jovians up top. ;)

-- 
Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; http://lambdaops.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140908/b3b6860d/attachment.html>