[SciPy-User] calculating numerous linear regressions quickly
josef.pktd at gmail.com
josef.pktd at gmail.com
Mon Jan 13 20:59:56 EST 2014
On Mon, Jan 13, 2014 at 3:06 PM, <josef.pktd at gmail.com> wrote:
> On Mon, Jan 13, 2014 at 2:49 PM, Bryan Woods <bwoods at aer.com> wrote:
>> Given some geospatial grid with a time dimension V[t, lat, lon], I want to
>> compute the trend at each spatial point in the domain. Essentially I am
>> trying to compute many linear regressions in the form:
>> y = mx+b
>> where y is the predicted value of V, x is the time coordinate array. The
>> coordinates t, lat, lon at all equispaced 1-D arrays, so the predictor (x,
>> or t) will be the same for each regression. I want to gather the regression
>> coefficients (m,b), correlation, and p-value for the temporal trend at each
>> spatial point. This can be directly accomplished by repeatedly calling
>> stats.linregress inside of a loop for every [lat, lon] point in the domain,
>> but it is not efficient.
>>
>> The challenge is that I need to compute a lot of them quickly and a python
>> loop is proving very slow. I feel like there should be some version of
>> stats.linregress that accepts and returns multidimensional without being
>> forced into using a python loop. Suggestions?
>
> That can be done completely without loops.
>
> reshape the grid to 2d (t, nlat*nlong) -> Y
> trend = np.vander(t, 2)
> (m,b) = np.linalg.pinv(trend).dot(Y)
>
> and then a few more array operations to get the other statistics.
>
> I can try to do it later if needed.
some quickly written version
this is for general x matrix as long as it is common to all regression
That's pretty much how statsmodels calculates OLS (sm.OLS is not
vectorized but has many other goodies instead).
A version that uses that there is only one slope coefficient might be
a bit faster, but I don't have that memorized.
That would be copying linregress and vectorizing it.
Josef
-------------------------------
# -*- coding: utf-8 -*-
"""multivariate OLS, vectorized independent OLS with common exog
Created on Mon Jan 13 19:33:44 2014
Author: Josef Perktold
"""
import numpy as np
from statsmodels.regression.linear_model import OLS
nobs = 20
k_y = 30
# generate trend
trend = np.linspace(-1, 1, nobs)
x = np.vander(trend, 2)
assert x.shape == (nobs, 2)
# generate random sample
beta = 1 + np.random.randn(2, k_y)
y_true = np.dot(x, beta)
y = y_true + 0.5 * np.random.randn(*y_true.shape)
######## estimate
# x is common design matrix (nobs, k_vars)
# y are independent/response variables (nobs, k_y)
xpinv = np.linalg.pinv(x)
params = xpinv.dot(y)
# get some additional results
xxinv = np.dot(xpinv, xpinv.T)
fitted = np.dot(x, params)
resid = y - fitted
y_mean = y.mean(0)
tss = ((y - y_mean)**2).sum(0)
rss = (resid**2).sum(0)
r_squared = 1 - rss / tss
df_resid = nobs - 2
mse = rss / df_resid
bse = np.sqrt(np.diag(xxinv)[:, None] * mse) # standard error of params
########
# reference statsmodels OLS
from collections import defaultdict as DefaultDict
results = DefaultDict(list)
for yj in y.T:
res = OLS(yj, x).fit()
results['params'].append(res.params)
results['bse'].append(res.bse)
results['r_squared'].append(res.rsquared)
results['mse'].append(res.mse_resid)
print '\nparameters'
print np.column_stack((params.T, results['params']))
print '\nstandard error of parameter estimates'
print np.column_stack((bse.T, results['bse']))
print '\nR-squared'
print np.column_stack((r_squared.T, results['r_squared']))
from numpy.testing import assert_allclose
assert_allclose(params.T, results['params'], rtol=1e-13)
assert_allclose(bse.T, results['bse'], rtol=1e-13)
assert_allclose(r_squared.T, results['r_squared'], rtol=1e-13)
---------------
>
> Josef
>
>>
>> Thanks,
>> Bryan
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
More information about the SciPy-User
mailing list