[SciPy-User] <transferred from Scipy-Dev>Re: Seeking help/ advice for applying functions

Tue Mar 9 15:11:55 EST 2010

On 9 March 2010 14:51, eat <e.antero.tammi at gmail.com> wrote:
>>
>> On 9 March 2010 09:59, eat <e.antero.tammi <at> gmail.com> wrote:
>> > Robert Kern <robert.kern <at> gmail.com> writes:
>> >
>> >>
>> >> Your example is not very clear. Can you write a less cryptic one with
>> >> informative variable names and perhaps some comments about what each
>> >> part is doing?
>> >>
>> >
> Anne Archibald <peridot.faceted <at> gmail.com> writes:
>
> Hi,
>
> I moved this thread from scipy-dev to scipy-user.
>
> First of all, Thanks Anne you clarified a lot.
>
>
>> > """
>> > Hi,
>> >
>> > I have tried to clarify my code. First part is the relevant one, the rest
> is
>> > there just to provide some context to run some tests.
>>
>> The short answer is that no, there's no way to optimize what you're doing.
>>
>> The long answer is: when numpy and scipy are fast, they are fast
>> because they avoid running python code: if you add two arrays, there's
>> only one line of python code, and all the work is done by loops
>> written in C. If your code is calling many different python functions,
>> well, since they're python functions, to apply them at all you must
>> necessarily execute python code. There goes any potential speed
>> advantage. (There may be a convenience advantage; if so, you can look
>> into using np.vectorize, which is just a wrapper around a python loop,
>> but is convenient.)
>
> Not only that I'm new to Scipy/ Numpy but I'm new to python aswell.
> I think I assumed that once python functions get compiled they'll would be
> (almost) as efficient than builtins.

Not especially. They are compiled to bytecode, whose execution is not
particularly fast. But the big problem is all the baggage of python's
nature as a dynamic language: for example each value is allocated with
malloc() and contains type information; for another example, each
access to a list involves identifying that the object really is a
list, dispatching to the list-lookup function, determining the type of
the argument (integer, slice object, long integer, other), and bounds
checking before finally returning the list element. Thus even tools
like cython that let you write, effectively, python code that gets
compiled to machine code are not much faster unless you can turn off
the dynamic features of python (which cython lets you do, selectively;
it's great).

>>
>> That said, I assume you are considering numpy/scipy because you have
>> arrays of thousands or more. It also seems unlikely that you actually
>> have thousands of different functions (that's an awful lot of source
>> code!). So if your "different" functions are actually just a handful
>> (or fewer) pieces of actual code, and you are getting your thousands
>> of functions by wrapping them up with parameters and local variables,
>> well, now there are possibilities. Exactly what possibilities depend
>> on what your functions look like - which is one reason Robert Kern
>> asked you to clarify your code - but they all boil down to rearranging
>> the problem so that it goes back to "few functions, much data", then
>> writing the functions in such a way that you can use numpy to apply
>> them to thousands or millions of data points at once.
>
> Yes, indeed I don't have so many different 'base' functions, but just
> configured (parametrized) different ways. However there are situations
> when it's advantage to be able to treat your fuctions as 'black boxes'.
> For example when parametrization is based on optimization.

Think about whether you can write each 'base' function to take arrays
as arguments, for example:

def F(x, mu, sigma):
    return np.exp(-((x-mu)/sigma)**2)

If your current code does something like

fis = [lambda x: F(x, mui, sigmai) for (mui, sigmai) in zip(muis, sigmais)]

r = [f(7) for f in fis]

you can rewrite it as the single line

r = F(7, muis, sigmais)

(if muis and sigmais are numpy arrays). Now you have just a couple of
lines of python, and the heavy lifting all happens inside numpy loops.

If you have several different functions, look into separating your
input arrays based on which function needs to be applied to them;
remember numpy lets you select out all the elements of an array
meeting a criterion.

I realize this kind of rewriting will mess up a nice clean functional
design, but as is often the case, that is the price you pay for
performance. If your code works and is fast enough, I recommend
leaving it as is.

Anne