[SciPy-user] Ols for np.arrays and masked arrays
josef.pktd at gmail.com
josef.pktd at gmail.com
Mon Jan 19 15:30:57 EST 2009
On Mon, Jan 19, 2009 at 11:53 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> Hi,
> I do not want to sound overly critical as I would like to assist with this.
>
> I am not sure of what your design goal is and what should the Regression
> class actually contain and do. Do you want something like an R lm object?
>
After our previous discussion, I was thinking how a more useful
interface for regression analysis could look like (without using a
full formula framework).
The ols estimation part was just a quick example for a regression. The
main purpose of this is to find a good interface.
My basic idea was to create a general regression class, that does the
initialization and common calculation and the specific estimation is
implemented in the subclasses, e.g. OLS, GLS, Ridge,...(?). For the
basic statistics, which are calculated on demand, I'm planning
something like http://www.scipy.org/Cookbook/OLS. I would like also to
get a similar wrapper for non-linear least squares, optimize.leastsq,
because that doesn't produce the (normalized) parameter standard
errors, or for non-linear maximum likelihood estimator. Also, I don't
want to duplicate work in stats.models.
> But, I do not like that your _init_ function does so much work that
> probably does not belong there. I would have thought that it would just
> initialize certain important variables including the solutions (b). One
> reasons is perhaps you just want to update the input arrays but this
> design forces you to create a new object.
>
> At what stage should you check for valid inputs and the correct
> dimensions of x?
I think this is the main point of the __init__ function, to transform
the data from a convenient specification for the user to the form that
is used by the estimation procedures. If the design matrix is not
already a "clean" array, then it needs to copied at least once to be
handed to linalf (as far as I understand this)
>
> I would also prefer the object having the standard errors of the
> solutions and eventually other 'useful' statistics like sum of squares,
> R-squared. However, I do not know how to the standard errors using
> linalg.lstsq (I vaguely recall there is another way that would work).
> The others are easy to get probably on demand like R's summary function.
My current thinking is that some basic statistics, such as estimated
parameters, standard errors, pvalues and R^2 are returned immediately.
More in depth analysis of the residuals are returned on demand in a
special result structure (class) (?).
Just to see how it works, I copied the cookbook ols recipe into my ols
class, I had to adjust the dimension (1D to 2D columns). It looks ok,
but I didn't test more than making sure it runs and looks reasonable.
Several of the test functions can go into the regression class, since
they apply also for other regression models not just OLS.
The current version takes one or several arrays as inputs. Inputs can
be 1D or 2D, and can be either numpy arrays or masked arrays. All
observations with at least one masked or nan value are removed from
the regression. Other array subclasses are not yet included.
Predicted values, called yhat, are masked arrays if the input was
masked or contained nans. If input are plain numpy arrays with finite
values, then the predicted value array, yhat, is also a np array. (I
haven't checked the copied part yet, eg. the error vector is on the
compressed values)
I attached the updated version. lm in R looks a bit like an umbrella
function, but I have not used it enough to tell what useful features
should be copied by a regression class. Generalized linear models will
be in stats.models.
Josef
-------------- next part --------------
import numpy as np
import numpy.ma as ma
import numpy.random as mtrand #import randn, seed
import matplotlib.pylab as plt
import time
from scipy import linalg, stats
#from numpy.testing import assert_almost_equal
from numpy.ma.testutils import assert_almost_equal
import regressionanalysis_ma2 as _ini
class Regression(object):
def __init__(self,y,*args,**kwds):
'''
Parameters
----------
y: array, 1D or 2D
variable of regression. If 2D, then it needs to be one column
args: sequence of 1D or 2D arrays
one or several arrays of regressors. If 2D than interpretation
is that each column represents one variable and rows are
observations
kwds: addconst (default: True)
if True then a constant is added to the regressor
return
------
class instance with regression results
Notes
-----
Observation (rows) that contain masked values in a masked array or nans
in any of the regressors or in y will be dropped for the calculation.
Arrays that correspond to observation, e.g. estimated y (yhat)
or residuals, are returned as masked arrays if masked values or nans are
in the input data.
Usage
-----
estm = Ols(y, x1, x2)
estm.b # estimated parameter vector
example polynomial
estv = Ols(y, np.vander(x[:,0],3), x[:,1], addconst=False)
estv.b # estimated parameter vector
estv.yhat # fitted values
'''
self.addconst = kwds.pop('addconst', True)
self.y_varnm = kwds.pop('y_varnm','y')
self.x_varnm = kwds.pop('x_varnm',[])
if not isinstance(self.x_varnm,list):
self.x_varnm = list(self.x_varnm)
# Make sure that any NaN in y and args are masked, and force masked arrays
y = ma.fix_invalid(y)
if y.ndim == 1:
y.shape = (-1, 1)
design = list(args)
#design = [ma.fix_invalid(x) for x in args]
# We need a constant: add a column of 1
if self.addconst:
design.insert(0, np.ones(y.shape))
self.x_varnm.insert(0, 'const')
# stack first than fix; saves one copy, maybe not
design = ma.column_stack(design) # convert types and make copy
#does it make copy if args = x single 2D array
design = ma.fix_invalid(design, copy=False)
# Find the masked elements
(ymask, designmask) = (ma.getmaskarray(y), ma.getmaskarray(design))
mask = ma.mask_or(ymask, designmask, shrink=False).any(axis=-1)
valid = ~mask
# And now, drop them
self.y = y.data[valid,:]
self.design = design.data[valid]
self.x = self.design # copy to use cookbook ols
self.mask = mask
self.ymasked = ma.array(y, mask=mask, keep_mask=True)
for ii in range(len(self.x_varnm),self.design.shape[1]+1):
self.x_varnm.append('var%002d'%ii)
#
#self.yhat = None
self.estimate()
def estimate(self):
pass
def get_yhat(self):
pass
def summary(self):
pass
def predict(self, x):
pass
class Ols(Regression):
def __init__(self,y,*args,**kwds):
Regression.__init__(self, y, *args, **kwds)
def estimate(self):
(y, x) = self.y, self.design
#print 'y.shape, x.shape'
#print y.shape, x.shape
(self.b, self.resid, rank, self.sigma) = linalg.lstsq(x, y)
#
## yhat = ma.empty_like(self.ymasked)
## mask = self.mask
## if mask is not ma.nomask:
## yhat[~mask, :] = np.dot(self.design, self.b)
## else:
## yhat[:,:] = np.dot(self.design, self.b)
#self.yhat = yhat
self.yhat = self.predict()
#print rank
# estimating coefficients, and basic stats
self.inv_xx = linalg.pinv(np.dot(self.x.T,self.x)) # use Moore-Penrose pseudoinverse
xy = np.dot(self.x.T,self.y) # estimate coefficients
#print self.b.shape # column vector
self.nobs = self.y.shape[0] # number of observations
self.ncoef = self.x.shape[1] # number of coef.
self.df_e = self.nobs - self.ncoef # degrees of freedom, error
self.df_r = self.ncoef - 1 # degrees of freedom, regression
self.e = self.y - np.dot(self.x,self.b) # residuals
self.sse = np.dot(self.e.T,self.e)/self.df_e # SSE
self.se = np.sqrt(np.diagonal(self.sse*self.inv_xx))[:,np.newaxis] # coef. standard errors
self.t = self.b / self.se # coef. t-statistics
self.p = (1-stats.t.cdf(np.abs(self.t), self.df_e)) * 2 # coef. p-values
self.R2 = 1 - self.e.var()/self.y.var() # model R-squared
self.R2adj = 1-(1-self.R2)*((self.nobs-1)/(self.nobs-self.ncoef)) # adjusted R-square
self.F = (self.R2/self.df_r) / ((1-self.R2)/self.df_e) # model F-statistic
self.Fpv = 1-stats.f.cdf(self.F, self.df_r, self.df_e) # F-statistic p-value
def dw(self):
"""
Calculates the Durbin-Waston statistic
"""
de = np.diff(self.e,1,axis=0)
dw = np.dot(de.T,de) / np.dot(self.e.T,self.e);
return dw
def omni(self):
"""
Omnibus test for normality
"""
return stats.normaltest(self.e)
def JB(self):
"""
Calculate residual skewness, kurtosis, and do the JB test for normality
"""
# Calculate residual skewness and kurtosis
skew = stats.skew(self.e)
kurtosis = 3 + stats.kurtosis(self.e)
# Calculate the Jarque-Bera test for normality
JB = (self.nobs/6) * (np.square(skew) + (1/4)*np.square(kurtosis-3))
JBpv = 1-stats.chi2.cdf(JB,2);
return JB, JBpv, skew, kurtosis
def ll(self):
"""
Calculate model log-likelihood and two information criteria
"""
# Model log-likelihood, AIC, and BIC criterion values
ll = -(self.nobs*1/2)*(1+np.log(2*np.pi)) - \
(self.nobs/2)*np.log(np.dot(self.e.T,self.e)/self.nobs)
aic = -2*ll/self.nobs + (2*self.ncoef/self.nobs)
bic = -2*ll/self.nobs + (self.ncoef*np.log(self.nobs))/self.nobs
return ll, aic, bic
def summary(self):
"""
Printing model output to screen
"""
# local time & date
t = time.localtime()
# extra stats
ll, aic, bic = self.ll()
JB, JBpv, skew, kurtosis = self.JB()
omni, omnipv = self.omni()
# printing output to screen
print '\n=============================================================================='
print "Dependent Variable: " + self.y_varnm
print "Method: Ordinary Least Squares"
print "Date: ", time.strftime("%a, %d %b %Y",t)
print "Time: ", time.strftime("%H:%M:%S",t)
print '# obs: %5.0f' % self.nobs
print '# variables: %5.0f' % self.ncoef
print '=============================================================================='
print 'variable coefficient std. Error t-statistic prob.'
print '=============================================================================='
for i in range(self.ncoef):
#print self.x_varnm[i],self.b[i],self.se[i],self.t[i],self.p[i]
print '''% -5s % 9.4f % 9.4f % 9.4f % 9.4f''' % tuple(
[self.x_varnm[i],self.b.ravel()[i],self.se.ravel()[i],self.t.ravel()[i],self.p.ravel()[i]])
print '=============================================================================='
print 'Models stats Residual stats'
print '=============================================================================='
print 'R-squared % 9.4f Durbin-Watson stat % 9.4f' % tuple([self.R2, self.dw()])
print 'Adjusted R-squared % 9.4f Omnibus stat % 9.4f' % tuple([self.R2adj, omni])
print 'F-statistic % 9.4f Prob(Omnibus stat) % 9.4f' % tuple([self.F, omnipv])
print 'Prob (F-statistic) % 9.4f JB stat % 9.4f' % tuple([self.Fpv, JB])
print 'Log likelihood % 9.4f Prob(JB) % 9.4f' % tuple([ll, JBpv])
print 'AIC criterion % 9.4f Skew % 9.4f' % tuple([aic, skew])
print 'BIC criterion % 9.4f Kurtosis % 9.4f' % tuple([bic, kurtosis])
print '=============================================================================='
def predict(self, x=None):
'''redurn prediction
todo: add prediction error, confidence intervall'''
if x is None:
x = self.design
yest = np.dot(x, self.b)
if (not self.mask is None) and self.mask.any():
# (not self.mask is None) for shorthand, not used currently
# (mask is not ma.nomask) is not necessary b/c no shrink
output = ma.masked_all_like(self.ymasked)
output[~self.mask, :] = yest
# does not remove mask for unmasked
return output
else:
return yest
else:
x = ma.fix_invalid(x)
if self.addconst:
x = ma.column_stack((ma.ones(x.shape[0]), x))
output = ma.dot(x, self.b)
#maybe detend
mask = output.mask
if (mask is not ma.nomask) and (not mask.any()):
#difficult to read 2 negatives
output = output.data
return output
def cookbook_example():
xxsingular = False#True
x = np.linspace(0, 15, 40)
a,b,c = 3.1, 42, -304.2
y_true = a*x**2 + b*x + c
y_meas = y_true + 100.01*np.random.standard_normal( y_true.shape )
if xxsingular:
xx = np.c_[x**2,x,2*x,np.ones(x.shape[0])]
else:
xx = np.c_[x**2,x,np.ones(x.shape[0])]
x_varnm = ['const', 'x1','x2','x3','x4']
k = xx.shape[1]
#m = Ols(y_meas,xx,y_varnm = 'y',x_varnm = x_varnm[:k-1],addconst = False)
m = Ols(y_meas,xx,y_varnm = 'y',x_varnm = x_varnm[:k],addconst = False)
m.summary()
#matplotlib ploting
plt.title('Linear Regression Example')
plt.plot(x,y_true,'g.--')
plt.plot(x,y_meas,'k.')
plt.plot(x,y_meas-m.e.flatten(),'r.-')
plt.legend(['original','plus noise', 'regression'], loc='lower right')
np.testing.assert_almost_equal(y_meas[:,np.newaxis]-m.e,m.yhat)
return m
if __name__ == '__main__':
import numpy as np
#from olsexample import ols
def generate_data(nobs):
x = np.random.randn(nobs,2)
btrue = np.array([[5,1,2]]).T
y = np.dot(x, btrue[1:,:]) + btrue[0,:] + 0.5 * np.random.randn(nobs,1)
return y,x
y,x = generate_data(15)
#benchmark no masked arrays, and one 2D array for x
est = Ols(y[1:,:],x[1:,:]) # initialize and estimate with ols, constant added by default
_est = _ini.Ols(y[1:,:],x[1:,:]) # initialize and estimate with ols, constant added by default
print 'ols estimate'
est.estimate()
print est.b.T
print _est.b.T
#print np.array([[5,1,2]]) # true coefficients
ynew,xnew = generate_data(3)
ypred = est.predict(xnew)
print ' ytrue ypred error'
print np.c_[ynew, ypred, ynew - ypred]
ypred = _est.predict(xnew)
print np.c_[ynew, ypred, ynew - ypred]
assert not isinstance(est.yhat, ma.MaskedArray)
assert not isinstance(ypred, ma.MaskedArray)
#case masked array y
ym = y.copy()
ym[0,:] = np.nan
ym = ma.masked_array(ym, np.isnan(ym))
estm1 = Ols(ym,x)
_estm1 = _ini.Ols(ym,x)
print estm1.b.T
print estm1.yhat.shape
print _estm1.b.T
print _estm1.yhat.shape
print 'yhat'
print estm1.yhat[:10,:]
print _estm1.yhat[:10,:]
assert_almost_equal(estm1.yhat[1:,:], est.yhat)
assert isinstance(estm1.yhat, ma.MaskedArray)
assert not isinstance(estm1.predict(x[:3,]), ma.MaskedArray)
# not sure about this one
#masked y and several x args, addconst=False
estm2 = Ols(ym,np.ones(ym.shape),x[:,0],x[:,1],addconst=False)
_estm2 = _ini.Ols(ym,np.ones(ym.shape),x[:,0],x[:,1],addconst=False)
print estm2.b.T
print _estm2.b.T
assert_almost_equal(estm2.b, estm1.b)
assert_almost_equal(estm2.yhat, estm1.yhat)
assert isinstance(estm2.yhat, ma.MaskedArray)
#masked y and several x args,
estm3 = Ols(ym,x[:,0],x[:,1])
_estm3 = _ini.Ols(ym,x[:,0],x[:,1])
print estm3.b.T
print _estm3.b.T
assert_almost_equal(estm3.b, estm1.b)
assert_almost_equal(estm3.yhat, estm2.yhat)
assert isinstance(estm3.yhat, ma.MaskedArray)
#masked array in y and one x variable
x_0 = x[:,0].copy() # is copy necessary?
x_0[0] = np.nan
x_0 = ma.masked_array(x_0, np.isnan(x_0))
estm4 = Ols(ym,x_0,x[:,1])
_estm4 = Ols(ym,x_0,x[:,1])
print estm4.b.T
print _estm4.b.T
assert_almost_equal(estm4.b, estm1.b)
assert_almost_equal(estm4.yhat, estm1.yhat)
assert isinstance(estm4.yhat, ma.MaskedArray)
#masked array in one x variable, but not in y
x_0 = x[:,0].copy() # is copy necessary?
x_0[0] = np.nan
x_0 = ma.masked_array(x_0, np.isnan(x_0))
estm5 = Ols(y,x_0,x[:,1]) #, addconst=False)
_estm5 = _ini.Ols(y,x_0,x[:,1]) #, addconst=False)
print estm5.b.T
print _estm5.b.T
assert_almost_equal(estm5.b, estm1.b)
assert_almost_equal(estm5.yhat, estm2.yhat)
assert isinstance(estm5.yhat, ma.MaskedArray)
#assert np.all(estm5.yhat == estm1.yhat)
#example polynomial
print 'example with one polynomial x added'
estv = Ols(y,np.vander(x[:,0],3), x[:,1], addconst=False)
_estv = _ini.Ols(y,np.vander(x[:,0],3), x[:,1], addconst=False)
print estv.b.T
print _estv.b.T
print estv.yhat
print _estv.yhat
assert not isinstance(estv.yhat, ma.MaskedArray)
m = cookbook_example()
More information about the SciPy-User
mailing list