[SciPy-user] linear regression

Wed May 27 20:40:39 EDT 2009

On Wed, May 27, 2009 at 8:24 PM,  <josef.pktd at gmail.com> wrote:
> On Wed, May 27, 2009 at 7:35 PM, Robert Kern <robert.kern at gmail.com> wrote:
>> On Wed, May 27, 2009 at 15:29,  <josef.pktd at gmail.com> wrote:
>>> On Wed, May 27, 2009 at 3:37 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>
>>>> For "y=f(x)" models, this is true. Both y and x can be multivariate,
>>>> and you can express the covariance of the uncertainties for each, but
>>>> not covariance between the y and x uncertainties. This is because of
>>>> the numerical tricks used for efficient implementation
>>>
>>> In this case, OLS would still be unbiased in the linear case, but
>>> maybe not efficient.
>>
>> Are you sure? I see significant deviations using a simple example
>> (albeit one which is utterly rigged in ODR's favor). The X
>> uncertainties start small and grow with increasing X. The Y
>> uncertainties start large and shrink with increasing X. Plotting the
>> estimates shows some strange structure in the OLS estimates.
>>
>>
>> import numpy as np
>>
>> from scipy.odr import RealData, ODR
>> from scipy.odr.models import unilinear
>>
>>
>> beta_true = np.array([-0.47960828215176365,  5.47674024758398481])
>> p_x = np.array([0.,.9,1.8,2.6,3.3,4.4,5.2,6.1,6.5,7.4])
>> p_y = beta_true[0] * p_x + beta_true[1]
>> p_sx = np.array([.03,.03,.04,.035,.07,.11,.13,.22,.74,1.])
>> p_sy = np.array([1.,.74,.5,.35,.22,.22,.12,.12,.1,.04])
>>
>>
>> def random_betas(n=500, prng=np.random):
>>    """ Compute random parameter vectors from both ODR and OLS by generating
>>    random data and fitting it.
>>    """
>>    odr_betas = []
>>    ols_betas = []
>>    for i in range(n):
>>        x = np.random.normal(p_x, p_sx)
>>        y = np.random.normal(p_y, p_sy)
>>
>>        # ODR:
>>        data = RealData(x, y, sx=p_sx, sy=p_sy)
>>        odr = ODR(data, unilinear, beta0=[1., 1.])
>>        odr_out = odr.run()
>>        odr_betas.append(odr_out.beta)
>>
>>        # Weighted OLS:
>>        A = np.ones((len(x), 2))
>>        A[:,0] = x
>>        # Weight by the Y error.
>>        A /= p_sy[:,np.newaxis]
>>        b, res, rank, s = np.linalg.lstsq(A, y/p_sy)
>>        ols_betas.append(b)
>>
>>        # Alternately:
>>        #ols = ODR(data, unilinear, beta0=[1., 1.])
>>        #ols.set_job(fit_type=2)
>>        #ols_out = ols.run()
>>        #ols_betas.append(ols_out.beta)
>>    return np.array(odr_betas).T, np.array(ols_betas).T
>>
>
> after removing the weighting in your example to get plain OLS, I get
>
>>>> bodr, bols = random_betas(5000)
>>>> bols.mean(1)
> array([-0.4757033 ,  5.46418868])
>>>> bodr.mean(1)
> array([-0.48364392,  5.49508047])
>>>> bodr.mean(1)-beta_true
> array([-0.00403564,  0.01834022])
>>>> bols.mean(1)-beta_true
> array([ 0.00390498, -0.01255157])
>
> I don't see yet why the results with weighted ols are much worse. I
> also confirmed with by inhouse econometrician, whether it's really
> unbiased and not just asympotically unbiased.
> As long as the measurement error in the x regressors are uncorrelated
> with the regression error, ols is unbiased
> y = X*b + u   E(X*u)=0  that's the part that is used in the proof of
> unbiasedness.

The variance of the error term in the regression equation is a linear
combination of the true error (p_sy) and the measurement error in x
(p_sx)

So the correct weighting would be according to  p_sy**2 + beta**2 *
p_sx**2, which is in practice not possible since we don't know beta,
or maybe iterative would work  (at least something like this should be
correct)

I haven't tried it yet with your example.

Josef