[Numpy-discussion] Fast threading solution thoughts

Thu Feb 12 00:46:09 EST 2009

Hi,

This is relevant for anyone who would like to speed up array based
codes using threads.

I have a simple loop that I have implemented using Cython:

def backstep(np.ndarray opti, np.ndarray optf,
             int istart, int iend, double p, double q):
    cdef int j
    cdef double *pi
    cdef double *pf
    pi = <double *>opti.data
    pf = <double *>optf.data

    with nogil:
        for j in range(istart, iend):
            pf[j] = (p*pi[j+1] + q*pi[j])

I need to call this function *many* times and each time cannot be
performed until the previous time is completely as there are data
dependencies.  But, I still want to parallelize a single call to this
function across multiple cores (notice that I am releasing the GIL
before I do the heavy lifting).

I want to break my loop range(istart,iend) into pieces and have a
thread do each piece.  The arrays have sizes 10^3 to 10^5.

Things I have tried:

* Use pthreads and create new threads for each call to my function.
This performed horribly due to the thread creation overhead.
* Use a simple threadpool implementation in Python.  This performed
horribly as well even though I was not recreating threads each call.
The reason in this case was the time spent waiting on locks in the
thread pool implementation (which is based on Queue and threading).
This is either a problem with threading itself or a fundamental
limitation of the pthreads library itself.
* My next step is to implement a thread pool using pthreads/Cython.
Assuming I do this right, this should be as fast as I can get using
pthreads.

The only tools that I know of that are *really* designed to handle
these types of fine-grained parallel problems are:

* Intel's TBB
* OpenMP
* Cilk++
* ???

This seem like pretty heavy solutions though.

So, do people have thoughts about ways of effectively using threads
(not processes) for thin parallel loops over arrays?  This is relevant
to Numpy itself as it would be very nice if all ufuncs could release
the GIL and be run on multiple threads.

Cheers,

Brian