[SciPy-User] adding linear fitting routine

Wed Dec 4 07:43:52 EST 2013

Daniele,

Thank you for your feedback. Regarding the points you raise:

1. Generalization to arbitrary degree polynomials.  This already
exists in numpy.polyfit.
 One limitation of polyfit is that it does not currently allow the user to
provide absolute uncertainties in the data, but there has been some
discussion of adding this capability.

2. Generalization to arbitrary orthogonal bases. There currently exists in
numpy fitting routines to various polynomials including chebfit, legfit,
lagfit, hermfit, hermefit.  I am not aware of a fitting routine in
numpy/scipy that works on arbitrary bases.

3. Speed. As far as I know there is no bug in any of the tested software.
 The unit test test_linfit.py<https://github.com/djpine/linfit/blob/master/test_linfit.py>
 (https://github.com/djpine/linfit) times the various fitting routines.
 You can run it yourself (and check the code--maybe you can spot some
errors) but on my laptop I get the results printed at the end of this
message.

linfit does not call a matrix inversion routine.  Instead it calculates the
best fit slope and y-intercept directly.  By contrast, polyfit does call a
matrix inversion routine (numpy.linalg.lstsq), which has a certain amount
of overhead that linfit avoids.  This may be why polyfit is slower than
linfit.

4. relsigma.  Other than using no weighting at all, there are basically two
ways that people weight data in a least squares fit.
    (a) provide explicit absolute estimates of the errors (uncertainties)
for each data point.  This is what physical scientists often do.  Setting
relsigma=False tells linfit to use this method of weighting the data.  If
the error estimates are accurate, then the covariance matrix provides
estimates of the uncertainties in the fitting parameters (the slope &
y-intercept).
    (b) provide relative estimates of the errors (uncertainties) for each
data point (it's assumed that the absolute errors are not known but
relative uncertainties between difference data points is known).  This is
what social scientists often do.  When only the relative uncertainties are
known, the covariance matrix needs to be rescaled in order to obtain
accurate estimates of the uncertainties in the fitting parameters.  Setting
relsigma=True tells linfit to use this method of weighting the data.

5.  Renaming `sigmay` parameter `yerr`.  Either choice is fine with me but
I used `sigmay` to be (mostly) consistent with scipy.optimize.curve_fit.

---------------------------------

Results of timing tests from test_linfit.py

test_linfit.py
....
Compare linfit to scipy.linalg.lstsq with relative individually weighted
data points
     10 data points: linfit is faster than scipy.linalg.lstsq by 1.26 times
    100 data points: linfit is faster than scipy.linalg.lstsq by 2.33 times
   1000 data points: linfit is faster than scipy.linalg.lstsq by 12 times
  10000 data points: linfit is faster than scipy.linalg.lstsq by 31.8 times
.
Compare linfit to scipy.linalg.lstsq with unweighted data points
     10 data points: linfit is faster than scipy.linalg.lstsq by 2.4 times
    100 data points: linfit is faster than scipy.linalg.lstsq by 2.5 times
   1000 data points: linfit is faster than scipy.linalg.lstsq by 2.9 times
  10000 data points: linfit is faster than scipy.linalg.lstsq by 3.5 times
 100000 data points: linfit is faster than scipy.linalg.lstsq by 4.4 times
1000000 data points: linfit is faster than scipy.linalg.lstsq by 4.6 times
.
Compare linfit to scipy.stats.linregress with unweighted data points
     10 data points: linfit is faster than scipy.stats.linregress by 5.2
times
    100 data points: linfit is faster than scipy.stats.linregress by 5.1
times
   1000 data points: linfit is faster than scipy.stats.linregress by 4.7
times
  10000 data points: linfit is faster than scipy.stats.linregress by 2.9
times
 100000 data points: linfit is faster than scipy.stats.linregress by 1.8
times
1000000 data points: linfit is faster than scipy.stats.linregress by 1.1
times
.
Compare linfit to polyfit with relative individually weighted data points
     10 data points: linfit is faster than numpy.polyfit by 2.6 times
    100 data points: linfit is faster than numpy.polyfit by 2.5 times
   1000 data points: linfit is faster than numpy.polyfit by 4.4 times
  10000 data points: linfit is faster than numpy.polyfit by 3.1 times
 100000 data points: linfit is faster than numpy.polyfit by 3.5 times
1000000 data points: linfit is faster than numpy.polyfit by 1.9 times
.
Compare linfit to polyfit with unweighted data points
     10 data points: linfit is faster than numpy.polyfit by 3 times
    100 data points: linfit is faster than numpy.polyfit by 3.5 times
   1000 data points: linfit is faster than numpy.polyfit by 4.3 times
  10000 data points: linfit is faster than numpy.polyfit by 6 times
 100000 data points: linfit is faster than numpy.polyfit by 9.5 times
1000000 data points: linfit is faster than numpy.polyfit by 7.1 times
.....
----------------------------------------------------------------------

On Wed, Dec 4, 2013 at 12:13 PM, Daniele Nicolodi <daniele at grinta.net>wrote:

> On 03/12/2013 11:19, David J Pine wrote:
> > I would like to get some feedback and generate some discussion about a
> > least squares fitting routine I submitted last Friday  [please
> > see adding linear fitting routine
> > <http://comments.gmane.org/gmane.comp.python.scientific.devel/18442> (29
> > Nov 2013)].  I know that everybody is very busy, but it would be helpful
> > to get some feedback and, I hope, eventually to get this routine added
> > to one of the basic numpy/scipy libraries.
>
>
> I think that adding least squares fitting routine which handles
> correctly uncertainties and computes the covariance matrix is a good
> idea. I wanted to do that myself since quite a while.
>
> However, I think that a generalization to arbitrary degree polynomials
> would be much more useful.  A linfit function may be added as a
> convenience wrapper.  Actually it would be nice to have something that
> works on arbitrary orthogonal bases, but it may be difficult to design a
> general interface for such a thing.
>
> Regarding your pull request, I don't really think that your code can be
> much faster than the general purpose lest square fitting already in
> scipy or numpy, modulo some bug somewhere.  You justify that saying that
> your solution is faster because it does not invert a matrix, but this is
> exactly what you are doing, except that you do not write the math in a
> matrix formalism.
>
> Furthermore, I didn't have a very close look but I don't understand what
> the `relsigma` parameter is supposed to do, and I would rename the
> `sigmay` parameter `yerr`.
>
> Cheers,
> Daniele
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20131204/19033d97/attachment.html>