[Numpy-discussion] Quick Question about Optimization

Mon May 19 21:14:58 EDT 2008

Robert Kern wrote:
> On Mon, May 19, 2008 at 6:55 PM, James Snyder <jbsnyder at gmail.com> wrote:
>> Also note, I'm not asking to match MATLAB performance.  It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
> 
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
> 
>             for n in range(0,time_milliseconds):
>                 u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                 v = u + sigma * stdnormrvs[n, :]
>                 theta = expfac_theta * prev_theta - (1-expfac_theta)
> 
>                 mask = (v >= theta)
> 
>                 S[n,np.squeeze(mask)] = 1
>                 theta[mask] += b
> 
>                 prev_u = u
>                 prev_theta = theta
> 
> 
> There aren't any good line-by-line profiling tools in Python, but you
> can fake it by making a local function for each line:
> 
>             def f1():
>                 u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                 return u
>             def f2():
>                 v = u + sigma * stdnormrvs[n, :]
>                 return v
>             def f3():
>                 theta = expfac_theta * prev_theta - (1-expfac_theta)
>                 return theta
>             def f4():
>                 mask = (v >= theta)
>                 return mask
>             def f5():
>                 S[n,np.squeeze(mask)] = 1
>             def f6():
>                 theta[mask] += b
> 
>             # Run Standard, Unoptimized Model
>             for n in range(0,time_milliseconds):
>                 u = f1()
>                 v = f2()
>                 theta = f3()
>                 mask = f4()
>                 f5()
>                 f6()
> 
>                 prev_u = u
>                 prev_theta = theta
> 
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
> 
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
> 

f6 can be sped up more than a factor of 2 by using putmask:

In [10]:xx = np.random.rand(100000)

In [11]:mask = xx > 0.5

In [12]:timeit xx[mask] += 2.34
100 loops, best of 3: 4.06 ms per loop

In [14]:timeit np.putmask(xx, mask, xx+2.34)
1000 loops, best of 3: 1.4 ms per loop

I think that
	xx += 2.34*mask
will be similarly quick, but I can't get ipython timeit to work with it.

Eric