[SciPy-User] Vectorizing scipy.optimize.curve_fit

Sun Feb 3 07:28:14 EST 2013

On Sat, Feb 2, 2013 at 9:38 PM, Ryan Nelson <rnelsonchem at gmail.com> wrote:
> Hi Paul,
>
> I've played around with this for scipy.optimize.leastsq, and I've never been
> successful. I'm not an expert, but I don't think the original code was set
> up for that type of usage. (Just speculating here... An equally likely
> explanation is that I'm not smart enough to figure it out :)
>
> What I've found really useful for this type of problem is IPython's new
> parallel architecture (IPython version 0.13.1). It's easy to set up an
> IPython cluster and get things running in parallel through the notebook
> interface, so I've attached a simple notebook that does a trivial fitting
> (using leastsq) of some noisy data. Before you run the notebook, you'll need
> to set up a cluster through the Cluster tab in the IPython notebook
> dashboard.
>
> Unfortunately, in my experience I've found that the speed improvement is
> only noticeable until the number of IPython engines in your cluster equals
> the number of cores not the number of processor threads. (This might be
> obvious for those in the know, but it took me a while to figure out.) But
> every little bit of improvement helps, I suppose.
>
> Ryan
>
>
> On 2/1/2013 6:07 PM, Paul Hobson wrote:
>
> Hey folks,
>
> I've run  into a bit of a roadblock. I've got some model runs (x) in an Nx2
> array where the first column is the input, and the second column is the
> output. So in a single case, I'd do:
>
> popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1])
>
> But how should I handle, say, 5000 model runs such that x.shape = (500, N,
> 2) and I want the 5000 results for popt?
>
> This works:
> popt_array = np.empty(5000, 2)
> for r, layer in enumerate(model_runs):
>     popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0] layer[:,1])
>     popt_array[r] = popt
>
> But is there a better (faster way)? The number of model runs and data points
> may grow sustatially (~10^4 runs and 10^3 data points).

I think there is no efficient way to avoid the loop.

besides going parallel:

If you are only interested in popt, then using optimize.leastsq
directly will save some calculations.

The other major speedup for this kind of problems is to find good
starting values. For example, if the solutions in the sequence are
close to each other, then using previous solutions as starting values
for the next case will speed up convergence.
IIRC, in statsmodels we use an average of previous solutions, after
some initial warm-up, in a similar problem. A moving average could
also be useful.

Josef

>
> Thanks,
> -paul
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>