[Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

Thu May 12 19:25:55 EDT 2016

Antoine Pitrou <solipsis <at> pitrou.net> writes:

> 
> On Thu, 12 May 2016 06:27:43 +0000 (UTC)
> Sturla Molden <sturla.molden <at> gmail.com> wrote:
> 
> > Allan Haldane <allanhaldane <at> gmail.com> wrote:
> > 
> > > You probably already know this, but I just wanted to note that the
> > > mpi4py module has worked around pickle too. They discuss how they
> > > efficiently transfer numpy arrays in mpi messages here:
> > > http://pythonhosted.org/mpi4py/usrman/overview.html#communicating-
python-objects-and-array-data
> > 
> > Unless I am mistaken, they use the PEP 3118 buffer interface to 
support
> > NumPy as well as a number of other Python objects. However, this 
protocol
> > makes buffer aquisition an expensive operation.
> 
> Can you define "expensive"?
> 
> > You can see this in Cython
> > if you use typed memory views. Assigning a NumPy array to a typed
> > memoryview (i,e, buffer acqisition) is slow.
> 
> You're assuming this is the cost of "buffer acquisition", while most
> likely it's the cost of creating the memoryview object itself.
> 
> Buffer acquisition itself only calls a single C callback and uses a
> stack-allocated C structure. It shouldn't be "expensive".
> 
> Regards
> 
> Antoine.
> 

When I looked at it, using a typed memoryview was between 7-50 times 
slower than using numpy directly:

http://thread.gmane.org/gmane.comp.python.cython.devel/14626

It looks like there was some improvement since then:

https://github.com/numpy/numpy/pull/3779

...and repeating my experiment shows the deficit is down to 3-11 times 
slower.

In [5]: x = randn(10000)

In [6]: %timeit echo_memview(x)
The slowest run took 14.98 times longer than the fastest. This could 
mean that an intermediate result is being cached.
100000 loops, best of 3: 5.31 µs per loop

In [7]: %timeit echo_memview_nocast(x)
The slowest run took 10.80 times longer than the fastest. This could 
mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.58 µs per loop

In [8]: %timeit echo_numpy(x)
The slowest run took 58.81 times longer than the fastest. This could 
mean that an intermediate result is being cached.
1000000 loops, best of 3: 474 ns per loop

-Dave