[SciPy-User] calculating numerous linear regressions quickly

Mon Jan 13 20:59:56 EST 2014

On Mon, Jan 13, 2014 at 3:06 PM,  <josef.pktd at gmail.com> wrote:
> On Mon, Jan 13, 2014 at 2:49 PM, Bryan Woods <bwoods at aer.com> wrote:
>> Given some geospatial grid with a time dimension V[t, lat, lon], I want to
>> compute the trend at each spatial point in the domain. Essentially I am
>> trying to compute many linear regressions in the form:
>> y = mx+b
>> where y is the predicted value of V, x is the time coordinate array. The
>> coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x,
>> or t) will be the same for each regression. I want to gather the regression
>> coefficients (m,b), correlation, and p-value for the temporal trend at each
>> spatial point. This can be directly accomplished by repeatedly calling
>> stats.linregress inside of a loop for every [lat, lon] point in the domain,
>> but it is not efficient.
>>
>> The challenge is that I need to compute a lot of them quickly and a python
>> loop is proving very slow. I feel like there should be some version of
>> stats.linregress that accepts and returns multidimensional without being
>> forced into using a python loop. Suggestions?
>
> That can be done completely without loops.
>
> reshape the grid to 2d (t, nlat*nlong)  -> Y
> trend = np.vander(t, 2)
> (m,b) = np.linalg.pinv(trend).dot(Y)
>
> and then a few more array operations to get the other statistics.
>
> I can try to do it later if needed.

some quickly written version
this is for general x matrix as long as it is common to all regression
That's pretty much how statsmodels calculates OLS (sm.OLS is not
vectorized but has many other goodies instead).

A version that uses that there is only one slope coefficient might be
a bit faster, but I don't have that memorized.
That would be copying linregress and vectorizing it.

Josef

-------------------------------
# -*- coding: utf-8 -*-
"""multivariate OLS, vectorized independent OLS with common exog
Created on Mon Jan 13 19:33:44 2014
Author: Josef Perktold
"""
import numpy as np
from statsmodels.regression.linear_model import OLS

nobs = 20
k_y = 30
# generate trend
trend = np.linspace(-1, 1, nobs)
x = np.vander(trend, 2)
assert x.shape == (nobs, 2)

# generate random sample
beta = 1 + np.random.randn(2, k_y)
y_true = np.dot(x, beta)
y = y_true + 0.5 * np.random.randn(*y_true.shape)

######## estimate
# x is common design matrix (nobs, k_vars)
# y are independent/response variables (nobs, k_y)
xpinv = np.linalg.pinv(x)
params = xpinv.dot(y)

# get some additional results
xxinv = np.dot(xpinv, xpinv.T)
fitted = np.dot(x, params)
resid = y - fitted
y_mean = y.mean(0)
tss = ((y - y_mean)**2).sum(0)
rss = (resid**2).sum(0)
r_squared = 1 - rss / tss
df_resid = nobs - 2
mse = rss / df_resid
bse = np.sqrt(np.diag(xxinv)[:, None] * mse)  # standard error of params
########

# reference statsmodels OLS
from collections import defaultdict as DefaultDict
results = DefaultDict(list)
for yj in y.T:
    res = OLS(yj, x).fit()
    results['params'].append(res.params)
    results['bse'].append(res.bse)
    results['r_squared'].append(res.rsquared)
    results['mse'].append(res.mse_resid)

print '\nparameters'
print np.column_stack((params.T, results['params']))
print '\nstandard error of parameter estimates'
print np.column_stack((bse.T, results['bse']))
print '\nR-squared'
print np.column_stack((r_squared.T, results['r_squared']))

from numpy.testing import assert_allclose
assert_allclose(params.T, results['params'], rtol=1e-13)
assert_allclose(bse.T, results['bse'], rtol=1e-13)
assert_allclose(r_squared.T, results['r_squared'], rtol=1e-13)

---------------

>
> Josef
>
>>
>> Thanks,
>> Bryan
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>