[SciPy-User] Dense 3d array as base, sparse 3d as exponent - how to compute do efficiently?

David Koch ogdude at googlemail.com
Thu Nov 25 08:35:43 EST 2010


 Hello,

I am trying to compute the the following as fast as possible:

seq_lik = scipy.prod(scipy.prod(T ** C, 1), 1)

where

 T is a m * m 2d array - a transition matrix where each row sums to 1.
C is a n_seq * m * m 3d array of type int - a "transition count" array with
less than 5% nnz. Some C[i,:,:] contain only zeros.

To sum it up, the operation calculates the likelihood seq_lik[i] of a
sequence "i" among n_seq sequences by taking the product of elements in T **
C[i,:,:].

Given, that C is very sparse I figured it is a waste to evaluate for all the
0 expontents. My program spends about 95% of its time at this line.

Scipy.sparse came to mind but there are no 3d sparse arrays and **/pow have
not been implemented. The best I could come up with for now was a solution
where I reshaped the 3d array into a 2d, used scipy.sparse and only
operated on non-zero elements. See here: http://codepad.org/AiAZoioo (also
see script attached to bottom of this post). This gives a speed-up of about
6 times compared to dense, where I assumed 5% nnz. However, if I check
what's been reported in a decade-old paper, a lot more should be possible on
a personal computer.

Question 1:
When it comes to performance, is there something to be gained by
implementing the discussed operation in a C/Cython extension?

Question 2:
Somewhat related, I sometimes get the impression that with SciPy I spend
more time figuring out how I can make clever use of ndarray manipulation
techniques such as dot and tensordot to avoid for-loops than it may (!) take
to just write down all the loops and compile them into a C extension.
Bearing in mind that not all ndarray manipulation functionality is
available/makes sense for sparse matrices, what do you recommend: step up
lin-alg mojo and use SciPy or write an extension? (... granted, there's
potenital for a different kind of head-ache in that).

Thank you in advance, any clues are appreciated - otherwise I try my luck on
stackoverflow ;-)

/David

Script (just needs scipy):
-------------------------
import scipy
import scipy.sparse
import time

if __name__ == '__main__':

    zero_frac = 0.95

    T = scipy.rand(400) # does not sum to one but whatever
    C = (scipy.rand(100000,400) > zero_frac).astype(scipy.int16)

    C_sparse = scipy.sparse.coo_matrix(C)
    sp_coords = C_sparse.nonzero()
    sp_data = C_sparse.data

    start_time = time.time()
    lik_dense = scipy.prod(T ** C, 1)
    print('Time dense: %f,' % (time.time() - start_time))

    start_time = time.time()
    log_results = scipy.log(T[sp_coords[1]] ** sp_data);
    lik_sparse = scipy.exp(scipy.sparse.csr_matrix((log_results, sp_coords),
shape=C_sparse.shape).sum(1))
    print('Time sparse: %f,' % (time.time() - start_time))

    print lik_dense[:10]
    print lik_sparse[:10]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20101125/28201990/attachment.html>


More information about the SciPy-User mailing list