[Numpy-discussion] Covariance matrix from polyfit

Thu Jul 4 21:40:17 EDT 2013

On Thu, Jul 4, 2013 at 8:01 PM, David Schaich <daschaich at gmail.com> wrote:
> Hi all,
>
> I recently adopted python, and am in the process of replacing my old
> analysis tools. For simple (e.g., linear) interpolations and
> extrapolations, in the past I used gnuplot. Today I set up the
> equivalent with polyfit in numpy v1.7.1, first running a simple test to
> reproduce the gnuplot result.
>
> A discussion on this list back in February alerted me that I should use
> 1/sigma for the weights in polyfit as opposed to 1/sigma**2. Fine --
> that's not what I'm used to, but I can make a note.
> http://mail.scipy.org/pipermail/numpy-discussion/2013-February/065649.html
>
> Another issue mentioned in that thread is scaling the covariance matrix by
> fac = resids / (len(x) - order - 2.0)
> This wreaked havoc on the simple test I mentioned above (and include
> below), which fit three data points to a straight line. I spent hours
> trying to figure out why numpy returned negative variances, before
> tracking down this line. And, indeed, if I add a fake fourth data point,
> I end up with inf's and nan's.
>
> There is some lengthy justification for subtracting that 2 around line
> 590 in lib/polynomial.py. Fine -- it's nothing I recall seeing before
> (and I removed it from my local installation), but I'm just a new user.
>
> However, I do think it is important to fix polyfit so that it doesn't
> produce pathological results like those I encountered today. Here are a
> couple of possibilities that would let the subtraction of 2 remain:
> * Check whether len(x) > order + 2, and if it is not, either
> ** Die with an error
> ** Scale by resids / (len(x) - order) instead of resids / (len(x) -
> order - 2.0)

I would throw out the -2, or at least make it optional like `ddof`.
(It's not in the docstring AFAICS)

returning a negative (!) definite covariance matrix is definitely a bug.
(should return nan or raise exception)

my 1.5 cents

Josef

>
> * Don't bother with this scaling at all, leaving it to the users (who
> can subtract 2 if they want). This is what scipy.optimize.leastsq does,
> after what seems to be a good deal of discussion: "This matrix must be
> multiplied by the residual variance to get the covariance of the
> parameter estimates – see curve_fit."
> http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html
> https://github.com/scipy/scipy/pull/448
>
> I leave it to those of you with more numpy experience to decide what
> would be the best way to go.
>
> Cheers,
> David
> http://www-hep.colorado.edu/~schaich/
>
>
> +++
> Here's the simple example:
>  >>> import numpy as np
>  >>> m = np.array([0.008, 0.01, 0.015])
>  >>> dat = np.array([1.0822582, 1.0805417, 1.0766624])
>  >>> weight = np.array([1/0.000370, 1/0.000355, 1/0.000249])
>  >>> out, cov = np.polyfit(m, dat, 1, full=False, w=weight, cov=True)
>  >>> print out, '\n', cov
> [-0.79269957 1.08854252]
> [[ -2.34965006e-04 2.84428412e-06]
> [ 2.84428412e-06 -3.66283662e-08]]
>  >>> print np.sqrt(-1. * cov[0][0])
> 0.0153285682895
>  >>> print np.sqrt(-1. * cov[1][1])
> 0.000191385386578
> +++
>
> Gnuplot gives
> +++
> Final set of parameters Asymptotic Standard Error
> ======================= ==========================
> A = -0.792719 +/- 0.01533 (1.934%)
> B = 1.08854 +/- 0.0001914 (0.01758%)
> +++
> so up to the negative sign, all is good.
>
> For its part, scipy.optimize.leastsq needs me to do the scaling:
> +++
>  >>> import numpy as np
>  >>> from scipy import optimize
>  >>> m = np.array([0.008, 0.01, 0.015])
>  >>> dat = np.array([1.0822582, 1.0805417, 1.0766624])
>  >>> err = np.array([0.000370, 0.000355, 0.000249])
>  >>> linear = lambda p, x: p[0] * x + p[1]
>  >>> errfunc = lambda p, x, y, err: (linear(p, x) - y) / err
>  >>> p_in = [-1., 1.]
>  >>> all_out = optimize.leastsq(errfunc, p_in[:], args=(m, dat, err),
> full_output = 1)
>  >>> out = all_out[0]
>  >>> cov = all_out[1]
>  >>> print out, '\n', cov
> [-0.79269959 1.08854252]
> [[ 3.40800756e-03 -4.12544212e-05]
> [ -4.12544212e-05 5.31270007e-07]]
>  >>> chiSq_dof = ((errfunc(out, m, dat, err))**2).sum() / (len(m) -
> len(out))
>  >>> cov *= chiSq_dof
>  >>> print cov
> [[ 2.34964190e-04 -2.84427528e-06]
> [ -2.84427528e-06 3.66282716e-08]]
> +++
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion