[IPython-dev] IPython.parallel slow push

Mon Aug 11 09:06:46 EDT 2014

If I understand your use case correctly, each computation node needs a
copy of all 2GB of the same data?

For reference, how does the transfer rate compare to rsync-ing the
same data over SSH?

These may be helpful:

* https://en.wikipedia.org/wiki/Locality_of_reference
* https://en.wikipedia.org/wiki/Clustered_file_system#Examples_2
* https://en.wikipedia.org/wiki/MapReduce#Dataflow
* https://spark.apache.org/
* http://continuum.io/blog/blaze :

> Unlike NumPy, Blaze is designed to handle out-of-core computations on large datasets that exceed the system memory capacity, as well as on distributed and streaming data. Blaze is able to operate on datasets transparently as if they behaved like in-memory NumPy arrays.
>
> We aim to allow analysts and scientists to productively write robust and efficient code, without getting bogged down in the details of how to distribute computation, or worse, how to transport and convert data between databases, formats, proprietary data warehouses, and other silos.
--
Wes Turner

On Mon, Aug 11, 2014 at 7:52 AM, Moritz Beber <moritz.beber at gmail.com> wrote:
> Dear all,
>
> I often find myself in the situation where I use IPython.parallel to push a
> bunch of data to the kernels (around 2 GB) and then do calculations over a
> large parameter space on that data. This is convenient and simple enough to
> do but copying the data takes a large amount of time (it scales with the
> number of kernels).
>
> Is this something to be avoided altogether, are there ways to speed it up?
>
> I'd welcome any pointers.
>
> Best,
> Moritz
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>