[SciPy-user] linear regression
josef.pktd at gmail.com
josef.pktd at gmail.com
Wed May 27 20:40:39 EDT 2009
On Wed, May 27, 2009 at 8:24 PM, <josef.pktd at gmail.com> wrote:
> On Wed, May 27, 2009 at 7:35 PM, Robert Kern <robert.kern at gmail.com> wrote:
>> On Wed, May 27, 2009 at 15:29, <josef.pktd at gmail.com> wrote:
>>> On Wed, May 27, 2009 at 3:37 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>
>>>> For "y=f(x)" models, this is true. Both y and x can be multivariate,
>>>> and you can express the covariance of the uncertainties for each, but
>>>> not covariance between the y and x uncertainties. This is because of
>>>> the numerical tricks used for efficient implementation
>>>
>>> In this case, OLS would still be unbiased in the linear case, but
>>> maybe not efficient.
>>
>> Are you sure? I see significant deviations using a simple example
>> (albeit one which is utterly rigged in ODR's favor). The X
>> uncertainties start small and grow with increasing X. The Y
>> uncertainties start large and shrink with increasing X. Plotting the
>> estimates shows some strange structure in the OLS estimates.
>>
>>
>> import numpy as np
>>
>> from scipy.odr import RealData, ODR
>> from scipy.odr.models import unilinear
>>
>>
>> beta_true = np.array([-0.47960828215176365, 5.47674024758398481])
>> p_x = np.array([0.,.9,1.8,2.6,3.3,4.4,5.2,6.1,6.5,7.4])
>> p_y = beta_true[0] * p_x + beta_true[1]
>> p_sx = np.array([.03,.03,.04,.035,.07,.11,.13,.22,.74,1.])
>> p_sy = np.array([1.,.74,.5,.35,.22,.22,.12,.12,.1,.04])
>>
>>
>> def random_betas(n=500, prng=np.random):
>> """ Compute random parameter vectors from both ODR and OLS by generating
>> random data and fitting it.
>> """
>> odr_betas = []
>> ols_betas = []
>> for i in range(n):
>> x = np.random.normal(p_x, p_sx)
>> y = np.random.normal(p_y, p_sy)
>>
>> # ODR:
>> data = RealData(x, y, sx=p_sx, sy=p_sy)
>> odr = ODR(data, unilinear, beta0=[1., 1.])
>> odr_out = odr.run()
>> odr_betas.append(odr_out.beta)
>>
>> # Weighted OLS:
>> A = np.ones((len(x), 2))
>> A[:,0] = x
>> # Weight by the Y error.
>> A /= p_sy[:,np.newaxis]
>> b, res, rank, s = np.linalg.lstsq(A, y/p_sy)
>> ols_betas.append(b)
>>
>> # Alternately:
>> #ols = ODR(data, unilinear, beta0=[1., 1.])
>> #ols.set_job(fit_type=2)
>> #ols_out = ols.run()
>> #ols_betas.append(ols_out.beta)
>> return np.array(odr_betas).T, np.array(ols_betas).T
>>
>
> after removing the weighting in your example to get plain OLS, I get
>
>>>> bodr, bols = random_betas(5000)
>>>> bols.mean(1)
> array([-0.4757033 , 5.46418868])
>>>> bodr.mean(1)
> array([-0.48364392, 5.49508047])
>>>> bodr.mean(1)-beta_true
> array([-0.00403564, 0.01834022])
>>>> bols.mean(1)-beta_true
> array([ 0.00390498, -0.01255157])
>
> I don't see yet why the results with weighted ols are much worse. I
> also confirmed with by inhouse econometrician, whether it's really
> unbiased and not just asympotically unbiased.
> As long as the measurement error in the x regressors are uncorrelated
> with the regression error, ols is unbiased
> y = X*b + u E(X*u)=0 that's the part that is used in the proof of
> unbiasedness.
The variance of the error term in the regression equation is a linear
combination of the true error (p_sy) and the measurement error in x
(p_sx)
So the correct weighting would be according to p_sy**2 + beta**2 *
p_sx**2, which is in practice not possible since we don't know beta,
or maybe iterative would work (at least something like this should be
correct)
I haven't tried it yet with your example.
Josef
More information about the SciPy-User
mailing list