[IPython-dev] Indexing Jupyter Notebooks

Mon Sep 8 16:15:22 EDT 2014

Kyle,

These ideas are great, and hope to see more work along these lines!

I've been running the jupyterhub for one of my classes since last week, and
it working very well. (If the project "jupyterhub" is new to those reading
this, it is IPython's server-based access to notebooks. Similar to wakari
or Sage Math Cloud, but on my own server. Students log in via the web, and
have regular access to the computer. I'm running on Ubuntu 14.10 at
http://jupyter.cs.brynmawr.edu/ if you want to see the bare bones look
right now. There are bugs and limitations.)

Almost immediately, there were a large number of notebooks to keep track
of... some that have help documentation, others that have examples, and
others in various languages. And the students have created a bunch of their
own (86 notebooks in one week).

What would be cool would be an API for creating support functions for a
"collection of notebooks"... whether they are just a folder on your
harddrive, a jupyterhub server, or a collection of distributed notebook via
their URLs (like you have with nbviewer), it would be wonderful to have a
variety of functions:

* search (including by json structure)
* lists/search available by: how many times viewed, date (eg, latest), most
up-voted, topic (eg, Physics), organization, author
* tag cloud
* dynamic sitemap

Basically, anything that Drupal or Wordpress have thought of, it would be
useful for a collection of notebooks. Tweak jupyterhub a bit and you have a
notebook-based blog, wiki, class website, or on-line journal.

So, I have little knowledge of what technology would be useful for
implementing indexing, but just hoping that whatever you create can be used
in a way that supports these kinds of uses. It would great to be able to
drop a search module into  jupyterhub, maybe configure it the way that
Django or Tornado do with plugins, and have it appear as a panel on the
main webpage.

As for privacy for nbviewer: yes, you should probably respect robots.txt.
Github and the like would seem to be fair game (perhaps unless they have
some "noindex" flag in their metadata.)

Thanks!

-Doug

On Mon, Sep 8, 2014 at 3:19 PM, Kyle Kelley <rgbkrk at gmail.com> wrote:

> Hey IPythonistas, Jovians, and friends,
>
> Before our IPython development meeting this past week (on the plane to SFO
> actually), we started experimenting with indexing Jupyter notebooks using
> ElasticSearch.
>
> The notebooks are JSON which ElasticSearch can handle directly:
>
> # pip install elasticsearch requests
> import elasticsearch
> import requests
>
> es = elasticsearch.Elasticsearch([{'host':"127.0.0.1" 'port':9200])
>
> resp = requests.get("http://bit.ly/lorenzsystem")
> notebook_contents = resp.json()
>
> resp = es.index(index='notebooks',
>                 doc_type='ipynb',
>                 body=notebook_contents)
>
>
> The current setup in the notebook viewer can be viewed (or hacked on) via
> https://github.com/ipython/nbviewer/blob/568ae74c9b8a74f7b24cac271b6b1fbdd1c42643/nbviewer/index.py#L34-L48
> .
>
> Announcing this at the development meetings got a lot of discussion going.
>
> *Potential Utility of Indexing Notebooks*
>
>    - Searching notebooks with plain old raw text search
>    - Searching notebooks using the actual notebook structure (by
>    language, cell types, etc.)
>    - Streaming feed of notebooks being "created"
>    - Galleries of plots
>    - Using code cells to build models for code completion
>    - Trending notebooks (* don't actually need a full notebook index for
>    this)
>    - Corpus of structured scholarly works (for some value of scholarly)
>
> Though Notebook Viewer only handles publicly accessible notebooks,
> including secret gists, this does bring up some privacy concerns.
>
> *Privacy Concerns*
>
> What notebooks can we expose through a search interface or API?
>
> Do we need to obey the robots.txt for each site?
>
> What about notebooks on GitHub? Do we use some robots.txt scheme? They're
> searchable on GitHub already.
>
> There was even more brought up during the dev meetings, but we'd like to
> hear from everyone within the IPython/Jupyter community.
>
> *Current Plans*
>
> Currently we're indexing notebooks to see what running elasticsearch nodes
> is like and determine our infrastructure needs. We're not exposing the data
> or the elasticsearch node for the time being. In fact, the current node
> that is indexing will be torn down shortly. It was a good experiment, but
> needs tweaking to be done in practice.
>
> We've at least determined that we need separate storage of
>
>    - Embedded images
>    - Static widgets
>
> We're currently thinking these should be referenced by UUID and either go
> to an object store (for us, Rackspace CloudFiles) or a database.
>
> Dumping an entire notebook as is into the notebook index actually doesn't
> lend well to searching the notebook natively (we would need to index
> specific fields, not make nested queries). I really need to understand
> ElasticSearch more fully as well, as I only started with it last week.
>
> Would love to hear your thoughts on all of this.
>
> Cheers,
>
> Kyle Kelley
>
>
> P.S. See http://en.wikipedia.org/wiki/Jovian_(fiction) for why I said
> Jovians up top. ;)
>
> --
> Kyle Kelley (@rgbkrk <https://twitter.com/rgbkrk>; http://lambdaops.com)
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140908/91274e11/attachment.html>