[Numpy-discussion] Quick Question about Optimization
Eric Firing
efiring at hawaii.edu
Mon May 19 21:14:58 EDT 2008
Robert Kern wrote:
> On Mon, May 19, 2008 at 6:55 PM, James Snyder <jbsnyder at gmail.com> wrote:
>> Also note, I'm not asking to match MATLAB performance. It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
>
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
>
> for n in range(0,time_milliseconds):
> u = expfac_m * prev_u + (1-expfac_m) * aff_input[n,:]
> v = u + sigma * stdnormrvs[n, :]
> theta = expfac_theta * prev_theta - (1-expfac_theta)
>
> mask = (v >= theta)
>
> S[n,np.squeeze(mask)] = 1
> theta[mask] += b
>
> prev_u = u
> prev_theta = theta
>
>
> There aren't any good line-by-line profiling tools in Python, but you
> can fake it by making a local function for each line:
>
> def f1():
> u = expfac_m * prev_u + (1-expfac_m) * aff_input[n,:]
> return u
> def f2():
> v = u + sigma * stdnormrvs[n, :]
> return v
> def f3():
> theta = expfac_theta * prev_theta - (1-expfac_theta)
> return theta
> def f4():
> mask = (v >= theta)
> return mask
> def f5():
> S[n,np.squeeze(mask)] = 1
> def f6():
> theta[mask] += b
>
> # Run Standard, Unoptimized Model
> for n in range(0,time_milliseconds):
> u = f1()
> v = f2()
> theta = f3()
> mask = f4()
> f5()
> f6()
>
> prev_u = u
> prev_theta = theta
>
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
>
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
>
f6 can be sped up more than a factor of 2 by using putmask:
In [10]:xx = np.random.rand(100000)
In [11]:mask = xx > 0.5
In [12]:timeit xx[mask] += 2.34
100 loops, best of 3: 4.06 ms per loop
In [14]:timeit np.putmask(xx, mask, xx+2.34)
1000 loops, best of 3: 1.4 ms per loop
I think that
xx += 2.34*mask
will be similarly quick, but I can't get ipython timeit to work with it.
Eric
More information about the NumPy-Discussion
mailing list