[SciPy-User] BFGS precision vs least_sq

Thu Oct 16 17:52:33 EDT 2014

On 16 October 2014 22:59, Matt Newville <newville at cars.uchicago.edu> wrote:

> On Wed, Oct 15, 2014 at 7:25 PM, Andrew Nelson <andyfaff at gmail.com> wrote:
>
>> Dear scipy users,
>> I am using BFGS as a final 'polishing' minimizer after using another
>> minimization technique.  I thought I would check how it performs against a
>> known non-linear regression standard,
>> http://www.itl.nist.gov/div898/strd/nls/data/thurber.shtml.
>>
>> I discovered that BFGS doesn't get to the absolute known minimum for this
>> least squares problem, even if the starting point is almost the best
>> solution.  Rather, it finishes with a chi2 of 5648.6 against the lowest
>> chi2 minimum of 5642.7 (using scipy.optimize.leastsq).  The problem seems
>> to be 'precision loss', as reported by fmin_bfgs.  I have tried adjusting
>> the gtol and epsilon options for fmin_bfgs, but did not do any better.
>> Ok, so the two chi2 are only out by 0.1%, but I would like to know why
>> BFGS can't get to the same minimum as leastsq.
>>
>>
> You mentioned the result for chi-square but not for the parameters
> themselves.  Are those close to the certified values?
>
> Your observation is consistent with my experience with the NIST standard
> reference data.  The tests are hard (and Thurber is listed as Higher level
> of difficulty), and leastsq() does much better than the scalar optimization
> methods from scipy both in terms of final results and number of function
> evaluations.
>
> I suspect that because leastsq() examines the full Jacboian (that is, has
> a residual array) instead of a single scalar value, it can better navigate
> the parameter space.   Leastsq() mixes Gauss-Netwon and steepest descent,
> but I don't know why that would give an improvement in ultimate chi-square
> for the Thurber problem over Newton's method (which I understand BFGS to
> be).
>
> I also suspect that the NIST "certified results" are derived from analysis
> with MINPACK-1 (ie, leastsq).   I don't know this for sure, but the tests
> are old enough (mid to late 1990s, perhaps older) that using MINPACK for
> the analysis would be reasonable.  Furthermore, the results from leastsq
> are very, very good.  For the Thurber data, the 37 data points have 5 to 7
> digits for x, and 4 digits for y.  The certified parameters and their
> uncertainties,  and the residual sum of squares are given to 10 digits  --
> the Readme states that not all of these values are statistically
> significant, and that 3 digits are hard to get for some problems.  A simple
> use of leastsq() for the Thurber data with no fussing with tolerances or
> scaling gives at least 4 correct digits for each parameter, at least 3
> digits for the parameter uncertainties and 8 digits for the residual sum of
> squares, from either of the 2 supplied starting values.    Similarly
> excellent results are found for most of the NIST tests with leastsq() -- it
> often does as well as NIST says to expect.   The certified values had to
> come from somewhere... why not from a non-linear least-squares using
> MINPACK?   That's just to say that comparing any method to leastsq on the
> NIST data may be a biased test in favor of leastsq.
>
> OTOH, BFGS fails to alter the initial values from either starting point
> for the Thurber data for me.  So the fact that you're finding 3 digits for
> residual sum of squares is a huge improvement!
>
> Whether the Thurber test is a good test for the polishing step of a global
> optimizer may be another issue.  For a global optimizer you probably *want*
> a scaler minimizer  -- so that maximum likelihood can be used instead of
> least-squares, for example.  That is, using leastsq() for the polishing
> stage may not be suitable.    And if the idea of global optimization is to
> locate good starting points for local minimization, then it shouldn't
> matter too much what local optimizer you use.
>
>
>
The starting parameters in the demo script I wrote are from a
differential_evolution minimisation followed by a BFGS 'polish'.  So in
that respect I'm taking the endpoint of a BFGS run and trying to get it
even lower by running BFGS again.  I then wanted to see if leastsq got any
lower.  It's a bit unfair, the output vector from BFGS is very close to the
certified values.
I would've thought that both should be equally good at calculating the
gradient/Jacobian (first derivative matrix), that's simply a vector of
\frac{\partial \chi^2}{\partial x_i}. I understand that the way that they
calculate the Hessians are different.

The reason I am trying to get to the bottom of this is because I am working
on the benchmark suite for (scalar) global optimizers. Each problem should
have a way of judging if the global minimum has been reached.  The easy way
is to just set an absolute tolerance for differences from the known minimum
energy e.g. 1e-4.  However, that won't be fair in problems such as these -
the BFGS is very close to the minimum energy (within 0.1%), it just can't
polish to the absolute minimum like leastsq can (at least from this
starting point).  But is that a reason to mark it as a fail? So for some
problems in the benchmarking perhaps there needs to be an rtol and atol.
How does one set those rtol/atol?
The polishing in differential_evolution is done by L-BFGS-S, it can't use
leastsq for the reason you mentioned.  But for the least squares problems
the leastsq can be run independently.

I went through the code for bfgs and got to line_search_wolfe (or something
along those lines), to see where the precision warning message came from.
There my code reading head gave out.  I was wondering if there were any
tolerances/epsilons I could set tighter to get that final drop of reduction
in energy, but I didn't get any further.  Perhaps the list can explain the
situation in which the precision message is emitted?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20141017/8eca462d/attachment.html>