[Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

Wed May 11 18:38:15 EDT 2016

Hi,

I've been thinking and exploring this for some time. If we are to
start some effort I'd like to help. Here are my comments, mostly
regarding to Sturla's comments.

1. If we are talking about shared memory and copy-on-write
inheritance, then we are using 'fork'. If we are free to use fork,
then a large chunk of the concerns regarding the python std library
multiprocessing is no longer relevant. Especially those functions must
be in a module limitation that tends to impose a special requirement
on the software design.

2. Picking of inherited shared memory array can be done minimally by
just picking the array_interface and the pointer address. It is
because the child process and the parent share the same address space
layout, guarenteed by the fork call.

3. The RawArray and RawValue implementation in std multiprocessing has
its own memory allocator for managing small variables. It is a huge
overkill (in terms of implementation) if we only care about very large
memory chunks.

4. Hidden sychronization cost on multi-cpu (NUMA?) systems. A choice
is to defer the responsibility of avoiding racing to the developer.
Simple structs for working on slices of array in parallel can cover a
huge fraction of use cases and fully avoid this issue.

5. Whether to delegate parallelism to underlying low level
implementation or to implement the paralellism in python while
maintaining the underlying low level implementation sequential is
probably dependent on the problem. It may be convenient as of the
current state of parallelism support in Python to delegate, but will
it forever be the case?

For example, after the MPI FFTW binding stuck for a long time, someone
wrote a parallel python FFT package
(https://github.com/spectralDNS/mpiFFT4py)  that uses FFTW for
sequential and write all parallel semantics in Python with mpi4py, and
it uses a more efficient domain decomposition.

6. If we are to define a set of operations I would recommend take a
look at OpenMP as a reference -- It has been out there for decades and
used widely. An equiavlant to the 'omp parallel for' construct in
Python will be a very good starting point and immediately useful.

- Yu

On Wed, May 11, 2016 at 11:22 AM, Benjamin Root <ben.v.root at gmail.com> wrote:
> Oftentimes, if one needs to share numpy arrays for multiprocessing, I would
> imagine that it is because the array is huge, right? So, the pickling
> approach would copy that array for each process, which defeats the purpose,
> right?
>
> Ben Root
>
> On Wed, May 11, 2016 at 2:01 PM, Allan Haldane <allanhaldane at gmail.com>
> wrote:
>>
>> On 05/11/2016 04:29 AM, Sturla Molden wrote:
>> > 4. The reason IPC appears expensive with NumPy is because
>> > multiprocessing
>> > pickles the arrays. It is pickle that is slow, not the IPC. Some would
>> > say
>> > that the pickle overhead is an integral part of the IPC ovearhead, but i
>> > will argue that it is not. The slowness of pickle is a separate problem
>> > alltogether.
>>
>> That's interesting. I've also used multiprocessing with numpy and didn't
>> realize that. Is this true in python3 too?
>>
>> In python2 it appears that multiprocessing uses pickle protocol 0 which
>> must cause a big slowdown (a factor of 100) relative to protocol 2, and
>> uses pickle instead of cPickle.
>>
>> a = np.arange(40*40)
>>
>> %timeit pickle.dumps(a)
>> 1000 loops, best of 3: 1.63 ms per loop
>>
>> %timeit cPickle.dumps(a)
>> 1000 loops, best of 3: 1.56 ms per loop
>>
>> %timeit cPickle.dumps(a, protocol=2)
>> 100000 loops, best of 3: 18.9 µs per loop
>>
>> Python 3 uses protocol 3 by default:
>>
>> %timeit pickle.dumps(a)
>> 10000 loops, best of 3: 20 µs per loop
>>
>>
>> > 5. Share memory does not improve on the pickle overhead because also
>> > NumPy
>> > arrays with shared memory must be pickled. Multiprocessing can bypass
>> > pickling the RawArray object, but the rest of the NumPy array is
>> > pickled.
>> > Using shared memory arrays have no speed advantage over normal NumPy
>> > arrays
>> > when we use multiprocessing.
>> >
>> > 6. It is much easier to write concurrent code that uses queues for
>> > message
>> > passing than anything else. That is why using a Queue object has been
>> > the
>> > popular Pythonic approach to both multitreading and multiprocessing. I
>> > would like this to continue.
>> >
>> > I am therefore focusing my effort on the multiprocessing.Queue object.
>> > If
>> > you understand the six points I listed you will see where this is going:
>> > What we really need is a specialized queue that has knowledge about
>> > NumPy
>> > arrays and can bypass pickle. I am therefore focusing my efforts on
>> > creating a NumPy aware queue object.
>> >
>> > We are not doing the users a favor by encouraging the use of shared
>> > memory
>> > arrays. They help with nothing.
>> >
>> >
>> > Sturla Molden
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>