[Numpy-discussion] Quick Question about Optimization

Mon May 19 23:26:26 EDT 2008

I've done a little profiling with cProfile as well as with dtrace
since the bindings exist in mac os x, and you can use a lot of the d
scripts that apply to python, so previously I've found that the
np.random call and the where (in the original code) were heavy hitters
as far as amount of time consumed.

The time has now been shaved down to ~9 seconds with this suggestion
from the original 13-14s, with the inclusing of Eric Firing's
suggestions.  This is without scipy.weave, which at the moment I can't
get to work for all lines, and when I just replace one of them
sucessfully it seems to run more slowly, I assume because it is
converting data back and forth.

Quick question regarding the pointer abstraction that's going on, the
following seems to work:
np.putmask(S[n,:],np.squeeze(mask),1)

with that section of S being worked on.  Is it safe to assume in most
cases while working with NumPy that without additional operations,
aside from indexing, that a reference rather than a copy is being
passed?  It certainly seems like this sort of thing, including stuff
like:

        u = self.u
        v = self.v
        theta = self.theta
        ...

without having to repack those data into self later, since u,v,theta
are just references to the existing data saves on code and whatnot,
but I'm a little worried about not being explicit.

Are there any major pitfalls to be aware of?  It sounds like if I do:
f = a[n,:] I get a reference, but if I did something like g = a[n,:]*2
I would get a copy.

Thanks guys.  This is definitely useful, especially in combination
with using PyPar on my dual core system I'm getting pretty good
performance :-)

If anyone has any clues on why scipy.weave is blowing
(http://pastebin.com/m79699c04) using the code I attached, I wouldn't
mind knowing.  Most of the times I've attempted using weave, I've been
a little baffled about why things aren't working.  I also don't have a
sense for whether all numpy functions should be "weavable" or if it's
just general array operations that can be put through weave.

I know this is the numpy list, so I can take things over to the scipy
list if that's more appropriate.

On Mon, May 19, 2008 at 7:36 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Mon, May 19, 2008 at 6:55 PM, James Snyder <jbsnyder at gmail.com> wrote:
>> Also note, I'm not asking to match MATLAB performance.  It'd be nice,
>> but again I'm just trying to put together decent, fairly efficient
>> numpy code.
>
> I can cut the time by about a quarter by just using the boolean mask
> directly instead of using where().
>
>            for n in range(0,time_milliseconds):
>                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                v = u + sigma * stdnormrvs[n, :]
>                theta = expfac_theta * prev_theta - (1-expfac_theta)
>
>                mask = (v >= theta)
>
>                S[n,np.squeeze(mask)] = 1
>                theta[mask] += b
>
>                prev_u = u
>                prev_theta = theta
>
>
> There aren't any good line-by-line profiling tools in Python, but you
> can fake it by making a local function for each line:
>
>            def f1():
>                u  =  expfac_m  *  prev_u + (1-expfac_m) * aff_input[n,:]
>                return u
>            def f2():
>                v = u + sigma * stdnormrvs[n, :]
>                return v
>            def f3():
>                theta = expfac_theta * prev_theta - (1-expfac_theta)
>                return theta
>            def f4():
>                mask = (v >= theta)
>                return mask
>            def f5():
>                S[n,np.squeeze(mask)] = 1
>            def f6():
>                theta[mask] += b
>
>            # Run Standard, Unoptimized Model
>            for n in range(0,time_milliseconds):
>                u = f1()
>                v = f2()
>                theta = f3()
>                mask = f4()
>                f5()
>                f6()
>
>                prev_u = u
>                prev_theta = theta
>
> I get f6() as being the biggest bottleneck, followed by the general
> time spent in the loop (about the same), followed by f5(), f1(), and
> f3() (each about half of f6()), followed by f2() (about half of f5()).
> f4() is negligible.
>
> Masked operations are inherently slow. They mess up CPU's branch
> prediction. Worse, the use of iterators in that part of the code
> frustrates compilers' attempts to optimize that away in the case of
> contiguous arrays.
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>  -- Umberto Eco
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>

-- 
James Snyder
Biomedical Engineering
Northwestern University
jbsnyder at gmail.com