[SciPy-Dev] New Tutorial on Optimize Help

Mon Oct 7 00:21:47 EDT 2019

Hi Christina,

On Mon, Sep 30, 2019 at 6:43 PM Christina Lee <chrissie.c.l at gmail.com>
wrote:

> Hi,
>
> I'm a SciPy technical writer and am currently rewriting the scipy.optimize
> tutorial, focusing on `minimize` right now.  While I've gotten a grasp of
> the "how", I still want to explain "why". Why choose one option over
> another? I could use information from those with more experience.
>

Thank you very much for taking this on -- this is heroic work.  I would
make no claims about what "most users" use these methods for, and doubt
anyone else really knows this either.   In my experience (maintainer of the
lmfit library), it seems that many people using `scipy.optimize` are
actually trying to solve curve-fitting problems.

Focusing on the group of methods wrapped by `minimize()` is a fine starting
point, but you may also want to consider explaining why the user would
choose `minimize()` over other approaches.  That is, describe what sort of
problems `minimize()` is most suitable for, which it can be useful for but
might not be the only approach, and for which it is not the most suitable.

The methods of `minimize()` are all scalar minimizers: the user's objective
function returns a single value ("cost") to minimize.  If their objective
function *really* calculates an array and then the user reduces that array
to a scalar (say as sum-of-squares or log-likelihood) then they are almost
certainly better of using `least_squares` or `leastsq` or a similar
solver.   It also seems to me that a fair number of people use `minimize()`
when their problem actually is or can be made linear.  That is, it might be
good to clarify when iterative methods are needed, and when regression
methods can be used.

The methods in `minimize()` are inherently local solvers.  Global
optimization is a different category, but a tutorial would probably do well
to describe the difference but also to be clear-eyed that many problems do
not require a global solver as much as they need a local solver plus some
prior knowledge and understanding of the problem to be solved.   That is,
many real problems are really solved with local solvers.  Many (maybe most)
problems actually do (and certainly should try to) start with a decent
understanding of the scope of the problem.  With a local solver,  the task
for the user is not a blind search, but refining variable values and trying
to understand their correlations and uncertainties.

That is all to say that you probably should not expect too much experience
with these topics on the part of the reader of the tutorial.

> A lot of methods are available.   Most problems can have BFGS thrown at
> them, but I want to explain something for those other cases.  Other
> situations could have features, like constraints or non-differentiability,
> that lend themselves to a specific method. But the module still has a lot
> of alternatives.  Are they there for academic purposes?  Are they the best
> for some problems? How could someone find that out?
>
>
Those are the right questions and some sort of answer themselves:  The
different methods represent the advance of history with the older and
simpler ones really reflecting the aversion to storing results in memory
and on performance (that is trying to find the solution with the least use
of memory and the fewest evaluations of the objective function).  Even now,
and in this conversation, the emphasis is on performance (IMHO, that
emphasis is to the detriment of "user-friendliness" and helping the user
identify "correctness").  The fact that you asked about when the user would
choose between different solvers and different methods for calculating
derivatives means that this burden, which is really a mathematical detail
(if an important one) of the method used,  is put on the user of
`minimize()`.

> For derivatives, users can choose to provide a function or three different
> types of finite-difference schemes.
>
> When is providing a function better than finite-difference derivatives?
> For Hessians, approximations are sometimes more efficient.  How can we know
> in advance if that's true? Is that ever true for gradients?
>
> How do we choose which finite-difference scheme? `3-point` and `cs` (if it
> is the symmetric approximation I think) have higher-order accuracy, but
> `cs` uses a point not yet computed.  Is `3-point` ever not the way to go?
>

Some of the other responses said that the user should provide functions for
derivatives.  I respectfully disagree that this should be emphasized in a
tutorial.  If (it can be a big "if") the user can write a function for the
derivatives (of cost with respect to each variable), that usually means
that the objective function is differentiable, and probably "sane".  This
means in turn that finite-difference methods really ought to work (in the
sense of "not fail").  Finite-difference Jacobians will certainly mean more
evaluations of the objective function.  Assuming that the Jacobian function
has a runtime that is about the same as the objective function, the total
runtime will almost certainly be less when providing a Jacobian function.

Does reduced runtime mean that analytic Jacobians should always be
preferred?  I think it does not.

The user of `scipy.optimize.minimize()` should know that run-time is far
less precious than write- and read-time.  At the very least, a Jacobian
function should be used only when the time spent *writing* the Jacobian
function is less than the time saved running the program. That "less than"
might even be "much less than".  Changing runtime by a factor of 5 from 1
second to 200 milliseconds is certainly *not* worth writing a Jacobian
(even if one types much faster and more accurately than average).
 Changing runtime to 100 hours to 20 hours might be worth spending some
time on writing (and testing and maintaining!) the extra code of the
Jacobian function. If you expect the program you are writing will be used
10,000 times, then sure, investigate a Jacobian function.  Or figure out
how to make the objective function faster. Or buy more compute power.

I would hope a tutorial would emphasize other aspects of minimization over
the technical detail and performance boost that might be found by providing
an analytic Jacobian.  These other aspects might include:
   how to compare solver methods and types of cost function (least-squares,
log-probability, etc).
   how to investigate whether the solution is only local" or is stable and
global.
   how starting values can affect the solution and run time.
   how one can evaluate the quality of the solution ("goodness of fit") and
estimate the uncertainties in the values found for the variables.

There is the open-ended but critical question of "is the objective function
correct", or for curve-fitting "is this mathematical model correct"?
Related to this is the question of whether the set of variables used is
actually complete or if there are some latent variables that the objective
function assumes but that should actually be tested?  These are challenging
topics, but also strike me as being of more importance and interest to the
user of `scipy.optimize` than ultimate performance.   That is, analytic
Jacobians strike me as "advanced usage" and would be a fine next step after
a tutorial is digested.

Sorry that was so long, and feel free to use or ignore any of it as you see
fit.
Cheers,

--Matt Newville
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20191006/dc351f12/attachment.html>