[SciPy-User] How to estimate error in polynomial coefficients from scipy.polyfit?

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Mar 26 00:10:37 EDT 2010


On Thu, Mar 25, 2010 at 11:40 PM, David Goldsmith
<d.l.goldsmith at gmail.com> wrote:
> On Thu, Mar 25, 2010 at 3:40 PM, Jeremy Conlin <jlconlin at gmail.com> wrote:
>>
>> Yikes!  This sounds like it may be more trouble than it's worth.  I
>> have a few sets of statistical data that each need to have curves fit
>> to them.
>
> That's an awfully generic need - it may be obvious from examination of the
> data that a line is inappropriate, but besides polynomials there are many
> other non-linear models (which can be linearly fit to data by means of data
> transformation) which possess fewer parameters (and thus are simpler from a
> parameter analysis perspective).  So, the question is: why are you fitting
> to polynomials?  If it's just to get a good fit to the data, you might be
> getting "more fit" than your data warrants (and even if that isn't a
> problem, you probably want to use a polynomial form different from "standard
> form," e.g., Lagrange interpolators).  Are you sure something like an
> exponential growth/decay or power law model (both of which are "more
> natural," linearizable, two-parameter models) wouldn't be more appropriate -
> it would almost certainly be simpler to analyze (and perhaps easier to
> justify to a referee).
>
> On this note, perhaps some of our experts might care to comment: what
> "physics" (in a generalized sense) gives rise to a polynomial dependency of
> degree higher than two?  The only generic thing I can think of is something
> where third or higher order derivatives proportional to the independent
> variable are important, and those are pretty uncommon.

Taylor series expansion is a "natural" source for higher order
polynomials, although as Anne pointed out, it's not a "nice" base for
the space of functions. In most of the examples I have seen, 3rd order
was the highest, after that non-parametric, Fourier, Chebychev or
Hermite.

I'm attaching the statsmodels equivalent of most of the example in the
Stata FAQ. A few parts are (still) a bit more complicated than the
Stata equivalent.
Final result: F-test rejects null of equal coefficients (small
p-value) if the coefficients are far enough apart, and does not reject
if the two regression equations are samples from the same data
generating process.

Josef


>
> DG
>
>>
>> I can see what the parameters are and they are similar
>> between each set.  I want to know if the coefficients are within a few
>> standard deviations of each other.  That's what I was hoping to get
>> at, but it seems to be more trouble than I'm willing to accept.
>> Unless anyone has a simpler solution.
>>
>> Thanks,
>> Jeremy
>> >
>> > Anne
>> >
>> >> residuals, rank, singular_values, rcond : present only if `full` = True
>> >>        Residuals of the least-squares fit, the effective rank of the
>> >> scaled
>> >>        Vandermonde coefficient matrix, its singular values, and the
>> >> specified
>> >>        value of `rcond`. For more details, see `linalg.lstsq`.
>> >>
>> >> I don't think any of these things are "design matrix" as you have
>> >> indicated I need.  The documentation for linalg.lstsq does not say
>> >> what rcond is unfortunately.   Any ideas?
>> >>
>> >> Jeremy
>> >> _______________________________________________
>> >> SciPy-User mailing list
>> >> SciPy-User at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/scipy-user
>> >>
>> > _______________________________________________
>> > SciPy-User mailing list
>> > SciPy-User at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>> >
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
# -*- coding: utf-8 -*-
"""F test for null hypothesis that coefficients in two regressions are the same

Created on Thu Mar 25 22:56:45 2010
Author: josef-pktd
"""

import numpy as np
from numpy.testing import assert_almost_equal
import scikits.statsmodels as sm

nobs = 10 #100
x1 = np.random.randn(nobs)
y1 = 10 + 15*x1 + 2*np.random.randn(nobs)

x1 = sm.add_constant(x1) #, prepend=True)
assert_almost_equal(x1, np.vander(x1[:,0],2), 16)
res1 = sm.OLS(y1, x1).fit()
print res1.params
print np.polyfit(x1[:,0], y1, 1)
assert_almost_equal(res1.params, np.polyfit(x1[:,0], y1, 1), 14)
print res1.summary(xname=['x1','const1'])

#regression 2
x2 = np.random.randn(nobs)
y2 = 19 + 17*x2 + 2*np.random.randn(nobs)
#y2 = 10 + 15*x2 + 2*np.random.randn(nobs)  # if H0 is true

x2 = sm.add_constant(x2) #, prepend=True)
assert_almost_equal(x2, np.vander(x2[:,0],2), 16)

res2 = sm.OLS(y2, x2).fit()
print res2.params
print np.polyfit(x2[:,0], y2, 1)
assert_almost_equal(res2.params, np.polyfit(x2[:,0], y2, 1), 14)
print res2.summary(xname=['x2','const2'])


# joint regression

x = np.concatenate((x1,x2),0)
y = np.concatenate((y1,y2))
dummy = np.arange(2*nobs)>nobs-1
x = np.column_stack((x,x*dummy[:,None]))

res = sm.OLS(y, x).fit()
print res.summary(xname=['x','const','x2','const2'])

print '\nF test for equal coefficients in 2 regression equations'
#effect of dummy times second regression is zero
#is equivalent to 3rd and 4th coefficient are both zero
print res.f_test([[0,0,1,0],[0,0,0,1]])

print '\nchecking coefficients individual versus joint'
print res1.params, res2.params
print res.params[:2], res.params[:2]+res.params[2:]
assert_almost_equal(res1.params, res.params[:2], 13)
assert_almost_equal(res2.params, res.params[:2]+res.params[2:], 13)


More information about the SciPy-User mailing list