[IPython-dev] IPython.parallel slow push

Fernando Perez fperez.net at gmail.com
Mon Aug 11 18:38:09 EDT 2014


On Mon, Aug 11, 2014 at 6:56 AM, Wes Turner <wes.turner at gmail.com> wrote:

> This [2] seems to suggest that anything that isn't a buffer,
> str/bytes, or numpy array is pickled and copied.
>

That is indeed correct.


>  Would it be faster to ETL into something like HDF5 (e.g. w/
> Pandas/PyTables) and just synchronize the dataset URI?
>

Absolutely.

IPython.parallel is NOT the right tool to use to move large amounts of data
around between machines. It's an important problem in parallel/distributed
computing, but also a very challenging one that is beyond our scope and
resources.

When using IPython.parallel, you should think of it as a good way to

- coordinate computation
- move code around
- move *small* data around
- have interactive control in parallel settings

But you should have a non-IPython strategy for moving big chunks of data
around. The right answer to that question will vary from one context to
another. In some cases a simple NFS mount may be enough, elsewhere
something like Hadoop FS or Disco FS may work, or a well-sharded database,
or whatever.

But it's simply a problem that we consider orthogonal to what
IPython.parallel can do well.

Hope this helps,

f


-- 
Fernando Perez (@fperez_org; http://fperez.org)
fperez.net-at-gmail: mailing lists only (I ignore this when swamped!)
fernando.perez-at-berkeley: contact me here for any direct mail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140811/390234bc/attachment.html>


More information about the IPython-dev mailing list