[Numpy-discussion] Help in speeding up accumulation in a matrix

Sat Jan 29 17:03:43 EST 2011

Thanks for the prompt reply!
I quickly tried that and it actually helps compared to the full
vectorized version.
Depending on the dimensions, the chunk size has to be tuned (typically
100 or so)
But I don't get any improvement w/r to the simple for loop (i can
almost match the time though).

My guess is that the dot product overhead is still too big for element
by element multiplication of matrices.
Can't I in anyway leverage the fact that the matrices are symmetric?
I'm sure there might some slicing wizardry for such a problem ;)

Can the += on the result be optimized by numpy.accumulate or similar?
Or is it done in C anyway w/o coming back to the python VM?

I also looked into the type of CONTIGUOUS array in memory but not much
luck here either.

-n

>> Hi all,
>>
>> First email to the list for me, I just want to say how grateful I am
>> to have python+numpy+ipython etc... for my day to day needs. Great
>> combination of software.
>>
>> Anyway, I've been having this bottleneck in one my algorithms that has
>> been bugging me for quite a while.
>> The objective is to speed this part up. I've been doing tons of
>> optimization and parallel processing around that piece of code to get
>> a decent run time.
>>
>> The problem is easy. You want to accumulate in a matrix, a weighted
>> sum of other matrices. Let's call this function scale_and_add:
>> def scale_and_add_re(R,w,Ms):
>>     (nb_add,mdim,j)=np.shape(Ms)
>>     for i in range(nb_add):
>>         R+=w[i]*Ms[i]
>>     return R
>> This 'for' loop bugs me since I know this will slow things down.
>>
>> But the dimension of these things are so large that any attempt to
>> vectorize this is slower and takes too much memory.
>> I typically work with 1000 weights and matrices, matrices of dimension
>> of several hundred.
>>
>> My current config is:
>> In [2]: %timeit scale_and_add_re(R,w,Ms)
>> 1 loops, best of 3: 392 ms per loop
>>
>> In [3]: w.shape
>> Out[3]: (2000,)
>>
>> In [4]: Ms.shape
>> Out[4]: (2000, 200, 200)
>> I'd like to be able to double these dimensions.
>>
>> For instance I could use broadcasting by using a dot product
>> %timeit dot(Ms.T,w)
>> 1 loops, best of 3: 1.77 s per loop
>> But this is i) slower ii) takes too much memory
>> (btw, I'd really need an inplace dot-product in numpy to avoid the
>> copy in memory, like the blas call in scipy.linalg. But that's for an
>> other thread...)
>>
>> The matrices are squared and symmetric. I should be able to get
>> something out of this, but I never found anything related to this in
>> Numpy.
>>
>> I also tried a Cython reimplementation
>> %timeit scale_and_add_reg(L1,w,Ms)
>> 1 loops, best of 3: 393 ms per loop
>> It brought nothing in speed.
>>
>> Here's the code
>> @cython.boundscheck(False)
>> def scale_and_add_reg(np.ndarray[float, ndim=2] R, np.ndarray[float,
>> ndim=1] w, np.ndarray[float, ndim=3] Ms):
>>     return _scale_and_add_reg(R,w,Ms)
>>
>> @cython.boundscheck(False)
>> cdef int _scale_and_add_reg(np.ndarray[float, ndim=2] R,
>> np.ndarray[float, ndim=1] w, np.ndarray[float, ndim=3] Ms):
>>     cdef unsigned int mdim
>>     cdef int nb_add
>>     (nb_add,mdim,j)=np.shape(Ms)
>>     cdef unsigned int i
>>     for i from 0 <= i < nb_add:
>>         R+=w[i]*Ms[i]
>>         #for j in range(mdim):
>>         #    for k in range(mdim):
>>         #        R[j][k]+=w[i]*Ms[i][j][k]
>>
>>     return 0
>>
>> So here's my question if someone has time to answer it: Did I try
>> anything possible? Should I move along and deal with this bottleneck?
>> Or is there something else I didn't think of?
>>
>> Thanks for reading, keep up the great work!
>>
>> -n
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
> Have you tried chunking the vectoriized version of the code?  By
> chunking it, you gain the speed-ups of vectorizing, but still stay
> within manageable memory sizes.
>
> You would first pre-allocate you output array.  Then you would create
> an array of indices starting at zero and ending with the size of your
> array.  The values would be spaced so that you could use the index at
> j as the start of a slice and the index at j + 1 as the end of a
> slice.  I typically use 4096 as my increment and allow the last chunk
> to be smaller.
>
> Your loop would loop through these indices, using your vectorized code
> in its body.  The modification is that you will need to specify the
> slices that the vectorized code operates on.
>
> Let us know if that helps!
> Ben Root
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>