[Numpy-discussion] Quick Question about Optimization

Mon May 19 20:36:54 EDT 2008

On Mon, May 19, 2008 at 6:55 PM, James Snyder <jbsnyder at gmail.com> wrote:
> Also note, I'm not asking to match MATLAB performance.  It'd be nice,
> but again I'm just trying to put together decent, fairly efficient
> numpy code.

I can cut the time by about a quarter by just using the boolean mask
directly instead of using where().

            for n in range(0,time_milliseconds):
                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
                v = u + sigma * stdnormrvs[n, :]
                theta = expfac_theta * prev_theta - (1-expfac_theta)

                mask = (v >= theta)

                S[n,np.squeeze(mask)] = 1
                theta[mask] += b

                prev_u = u
                prev_theta = theta

There aren't any good line-by-line profiling tools in Python, but you
can fake it by making a local function for each line:

            def f1():
                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
                return u
            def f2():
                v = u + sigma * stdnormrvs[n, :]
                return v
            def f3():
                theta = expfac_theta * prev_theta - (1-expfac_theta)
                return theta
            def f4():
                mask = (v >= theta)
                return mask
            def f5():
                S[n,np.squeeze(mask)] = 1
            def f6():
                theta[mask] += b

            # Run Standard, Unoptimized Model
            for n in range(0,time_milliseconds):
                u = f1()
                v = f2()
                theta = f3()
                mask = f4()
                f5()
                f6()

                prev_u = u
                prev_theta = theta

I get f6() as being the biggest bottleneck, followed by the general
time spent in the loop (about the same), followed by f5(), f1(), and
f3() (each about half of f6()), followed by f2() (about half of f5()).
f4() is negligible.

Masked operations are inherently slow. They mess up CPU's branch
prediction. Worse, the use of iterators in that part of the code
frustrates compilers' attempts to optimize that away in the case of
contiguous arrays.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco