[IPython-dev] pyspark and IPython

Nitin Borwankar nborwankar at gmail.com
Thu Aug 29 18:15:29 EDT 2013


Hi Brian,

Yes, ok I wasn't clear either.  Meta thing - IPython and NB have spoilt me
- I want IPy as a cmd line for everything - and be able to launch all
cmdline programs from IPy and IPyNB.  So that's the meta goal.  Every new
cmdline I encounter I try to see if !ls works if not it is not a good
enough cmdline any more and I try to see if there is a cell magic for that
cmdline :-) !

In the context of Spark/Shark and family, they are early efforts and I want
to be able to play with the many moving parts in there fast and furiously,
without being limited by the earliness of their interface.  So if I can
plug these into IPy then all the better.

I am not sure if there's any value in integrating the two parallel
computing models as they seem to serve different audiences.  The IPy
parallel computing model seems closer to what the scientific community
needs and at first sight the Spark/Shark model seems to serve the more
business oriented data demographic.

Where there is an intersection is IMHO, in the ML area - we will learn
about that tomorrow in the conference.  In any case in the spirit of "IPy
over everything" I'd like to hope I can do some integration.

Also Fernando is here too and we chatted at lunch but pretty much about
everything else except the AMP stuff.  I think it makes more sense to wait
till the end of day tomorrow to report on the content.

Nitin

P.S. In an IETF meetings decades ago Vint Cerf wore a t-shirt that said "IP
over everything", so "IPy over everything" is my homage to that t-shirt.




------------------------------------------------------------------
Nitin Borwankar
nborwankar at gmail.com


On Thu, Aug 29, 2013 at 2:58 PM, Brian Granger <ellisonbg at gmail.com> wrote:

> Sorry I wasn't clear in my question.  I am very aware of how amazing
> Spark and Shark are.  I do think you are right that they are looking
> very promising right now.  What I don't see is what IPython can offer
> in working with them.  Given their architecture, I don't see how for
> example you could run spark jobs from the IPython Notebook
> interactively.  Is that the type of thing you are thinking about?  Or
> are you more thinking about direct integration of spark and
> IPython.parallel.  I am more wondering what the benefit of
> IPython+Spark integration would be.  I know that Fernando and Min have
> talked with some of the AMP lab people and I would love to see what
> can be done.  I would probably be best to sit down and talk further
> with the spark/shark devs at some point.  But if you can learn more
> about their architecture and investigate the possibilities and report
> back, that would be fantastic.
>
> On Thu, Aug 29, 2013 at 2:41 PM, Nitin Borwankar <nborwankar at gmail.com>
> wrote:
> > Hi Brian,
> >
> > The advantage IMHO is that pyspark and the larger UCB AMP effort are a
> huge
> > open source effort for distributed parallel computing that improves upon
> the
> > Hadoop model. Spark the underlying layer + Shark the Hive compatible
> query
> > language adds performance gains of 10x - 100x.  The effort has 20+
> companies
> > contributing code including Yahoo and 70+ contributors. AMP has a 10M$
> grant
> > from NSF.  So
> > a) it's not going away soon
> > b) it may be hard to compete with it without that level of resources
> > c) they do have a Python shell (have not used it yet) and they appear
> > committed to have Python as a first class language in their effort.
> > d) lets see if we can find ways to integrate with it.
> >
> > I think integration at the level of the interactive interface might make
> > sense.
> >
> > Just my 2c but I think this effort may leapfrog pure Hadoop over the next
> > 2-3 years.
> >
> >
> > Nitin.
> >
> >
> >
> >
> > ------------------------------------------------------------------
> > Nitin Borwankar
> > nborwankar at gmail.com
> >
> >
> > On Thu, Aug 29, 2013 at 1:35 PM, Brian Granger <ellisonbg at gmail.com>
> wrote:
> >>
> >> >From a quick glance, it looks like both pyspark and IPython use
> >> similar parallel computing models in terms of the process model.  You
> >> might think that would help them to integrate, but in this case I
> >> think it will get in the way of integration.  Without learning more
> >> about the low-level details of their architecture it is really
> >> difficult to know if it is possible or not.  But I think the bigger
> >> question is what would the motivation for integration be?  Both
> >> IPython and spark provide self-contained parallel computing
> >> capabilties - what usage cases are there for using both at the same
> >> time?  I think the biggest potential show stopper is that pyspark is
> >> not designed in any way to be interactive as far as I can tell.
> >> Pyspark jobs basically run in batch mode, which is going to make it
> >> really tough to fit into IPython's interactive model.  Worth looking
> >> more into though..
> >>
> >> Cheers,
> >>
> >> Brian
> >>
> >> On Thu, Aug 29, 2013 at 11:28 AM, Nitin Borwankar <nborwankar at gmail.com
> >
> >> wrote:
> >> > I'm at AmpCamp3 at UCB and see that there would be huge benefits to
> >> > integrating pyspark with IPython and IPyNB.
> >> >
> >> > Questions:
> >> >
> >> > a) has this been attempted/done? if so pointers pl.
> >> >
> >> > b) does this overlap the IPyNB parallel computing effort in
> >> > conflicting/competing ways?
> >> >
> >> > c) if this has not been done yet - does anyone have a sense of how
> much
> >> > effort this might be? (I've done a small hack integrating postgres
> psql
> >> > into
> >> > ipynb so I'm not terrified by that level of deep digging, but are
> there
> >> > any
> >> > show stopper gotchas?)
> >> >
> >> > Thanks much,
> >> >
> >> > Nitin
> >> > ------------------------------------------------------------------
> >> > Nitin Borwankar
> >> > nborwankar at gmail.com
> >> >
> >> > _______________________________________________
> >> > IPython-dev mailing list
> >> > IPython-dev at scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/ipython-dev
> >> >
> >>
> >>
> >>
> >> --
> >> Brian E. Granger
> >> Cal Poly State University, San Luis Obispo
> >> bgranger at calpoly.edu and ellisonbg at gmail.com
> >> _______________________________________________
> >> IPython-dev mailing list
> >> IPython-dev at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/ipython-dev
> >
> >
> >
> > _______________________________________________
> > IPython-dev mailing list
> > IPython-dev at scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-dev
> >
>
>
>
> --
> Brian E. Granger
> Cal Poly State University, San Luis Obispo
> bgranger at calpoly.edu and ellisonbg at gmail.com
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20130829/9421ed18/attachment.html>


More information about the IPython-dev mailing list