[IPython-dev] IPython.parallel slow push

Mon Aug 11 09:56:16 EDT 2014

On Mon, Aug 11, 2014 at 8:35 AM, Moritz Beber <moritz.beber at gmail.com> wrote:
> Hi Wes,
>
> Thank you for the quick response.
>
>
> On Mon, Aug 11, 2014 at 3:06 PM, Wes Turner <wes.turner at gmail.com> wrote:
>>
>> If I understand your use case correctly, each computation node needs a
>> copy of all 2GB of the same data?
>>
>> For reference, how does the transfer rate compare to rsync-ing the
>> same data over SSH?
>
>
> I should have been more clear, it's actually a local cluster, i.e., same
> host just multiple local kernels.

Got it. So the network is probably not the primary bottleneck.

> So I have one process that loads the data,
> does some pre-processing and then creates a dictionary that it pushes to the
> namespace of the other kernels.

A 2GB dictionary from RAM to { ... }.

With Celery, similar workflows can be modeled with a "Chord". [1]

> I understand that the data needs to go
> through ZMQ but since it's on the same machine I expected it to be faster
> (with 64 kernels it takes about 15 min).

TBH, I'm not too familiar with IPython.parallel.

This [2] seems to suggest that anything that isn't a buffer,
str/bytes, or numpy array is pickled and copied.

Would it be faster to ETL into something like HDF5 (e.g. w/
Pandas/PyTables) and just synchronize the dataset URI?

[1] http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords
[2] http://ipython.org/ipython-doc/dev/parallel/parallel_details.html#what-is-sendable

> If desired I can come up with a
> minimal notebook.
>
> Cheers,
> Moritz
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>

--
Wes Turner