[SciPy-user] Ols for np.arrays and masked arrays

Mon Jan 19 15:30:57 EST 2009

On Mon, Jan 19, 2009 at 11:53 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> Hi,
> I do not want to sound overly critical as I would like to assist with this.
>
> I am not sure of what your design goal is and what should the Regression
> class actually contain and do.  Do you want something like an R lm object?
>

After our previous discussion, I was thinking how a more useful
interface for regression analysis could look like (without using a
full formula framework).
The ols estimation part was just a quick example for a regression. The
main purpose of this is to find a good interface.

My basic idea was to create a general regression class, that does the
initialization and common calculation and the specific estimation is
implemented in the subclasses, e.g. OLS, GLS, Ridge,...(?). For the
basic statistics, which are calculated on demand, I'm planning
something like http://www.scipy.org/Cookbook/OLS. I would like also to
get a similar wrapper  for non-linear least squares, optimize.leastsq,
because that doesn't produce the (normalized) parameter standard
errors, or for non-linear maximum likelihood estimator. Also, I don't
want to duplicate work in stats.models.

> But, I do not like that your _init_ function does so much work that
> probably does not belong there. I would have thought that it would just
> initialize certain important variables including the solutions (b). One
> reasons is perhaps you just want to update the input arrays but this
> design forces you to create a new object.
>
> At what stage should you check for valid inputs and the correct
> dimensions of x?

I think this is the main point of the __init__ function, to transform
the data from a convenient specification for the user to the form that
is used by the estimation procedures. If the design matrix is not
already a "clean" array, then it needs to copied at least once to be
handed to linalf (as far as I understand this)

>
> I would also prefer the object having the standard errors of the
> solutions and eventually other 'useful' statistics like sum of squares,
> R-squared. However, I do not know how to the standard errors using
> linalg.lstsq (I vaguely recall there is another way that would work).
> The others are easy to get probably on demand like R's summary function.

My current thinking is that some basic statistics, such as estimated
parameters, standard errors, pvalues and R^2 are returned immediately.
More in depth analysis of the residuals are returned on demand in a
special result structure (class) (?).

Just to see how it works, I copied the cookbook ols recipe into my ols
class, I had to adjust the dimension (1D to 2D columns). It looks ok,
but I didn't test more than making sure it runs and looks reasonable.
Several of the test functions can go into the regression class, since
they apply also for other regression models not just OLS.

The current version takes one or several arrays as inputs. Inputs can
be 1D or 2D, and can be either numpy arrays or masked arrays. All
observations with at least one masked or nan value are removed from
the regression. Other array subclasses are not yet included.
Predicted values, called yhat, are masked arrays if the input was
masked or contained nans. If input are plain  numpy arrays with finite
values, then the predicted value array, yhat, is also a np array. (I
haven't checked the copied part yet, eg. the error vector is on the
compressed values)

I attached the updated version. lm in R looks a bit like an umbrella
function, but I have not used it enough to tell what useful features
should be copied by a regression class. Generalized linear models will
be in stats.models.

Josef
-------------- next part --------------
import numpy as np
import numpy.ma as ma
import numpy.random as mtrand #import randn, seed
import matplotlib.pylab as plt
import time
from scipy import linalg, stats
#from numpy.testing import assert_almost_equal
from numpy.ma.testutils import assert_almost_equal

import regressionanalysis_ma2 as _ini

class Regression(object):
    def __init__(self,y,*args,**kwds):
        '''
        Parameters
        ----------

        y: array, 1D or 2D
            variable of regression. If 2D, then it needs to be one column
        args: sequence of 1D or 2D arrays
            one or several arrays of regressors. If 2D than interpretation
            is that each column represents one variable and rows are
            observations
        kwds: addconst (default: True)
            if True then a constant is added to the regressor

        return
        ------
        class instance with regression results

        Notes
        -----

        Observation (rows) that contain masked values in a masked array or nans
        in any of the regressors or in y will be dropped for the calculation.
        Arrays that correspond to observation, e.g. estimated y (yhat)
        or residuals, are returned as masked arrays if masked values or nans are
        in the input data.

        Usage
        -----
        estm = Ols(y, x1, x2)
        estm.b  # estimated parameter vector

        example polynomial
        estv = Ols(y, np.vander(x[:,0],3), x[:,1], addconst=False)
        estv.b  # estimated parameter vector
        estv.yhat # fitted values

        '''
        self.addconst = kwds.pop('addconst', True)
        self.y_varnm = kwds.pop('y_varnm','y')
        self.x_varnm = kwds.pop('x_varnm',[])
        if not isinstance(self.x_varnm,list):
            self.x_varnm = list(self.x_varnm)
        # Make sure that any NaN in y and args are masked, and force masked arrays
        y = ma.fix_invalid(y)
        if y.ndim == 1:
            y.shape = (-1, 1)
        design = list(args)
        #design = [ma.fix_invalid(x) for x in args]
        # We need a constant: add a column of 1
        if self.addconst:
            design.insert(0, np.ones(y.shape))
            self.x_varnm.insert(0, 'const')

        # stack first than fix; saves one copy, maybe not
        design = ma.column_stack(design)  # convert types and make copy
        #does it make copy if args = x single 2D array
        design = ma.fix_invalid(design, copy=False)

        # Find the masked elements
        (ymask, designmask) = (ma.getmaskarray(y), ma.getmaskarray(design))
        mask = ma.mask_or(ymask, designmask, shrink=False).any(axis=-1)
        valid = ~mask
        # And now, drop them
        self.y = y.data[valid,:]
        self.design = design.data[valid]
        self.x = self.design    # copy to use cookbook ols
        self.mask = mask
        self.ymasked = ma.array(y, mask=mask, keep_mask=True)

        for ii in range(len(self.x_varnm),self.design.shape[1]+1):
            self.x_varnm.append('var%002d'%ii)
        #
        #self.yhat = None
        self.estimate()

    def estimate(self):
        pass

    def get_yhat(self):
        pass

    def summary(self):
        pass

    def predict(self, x):
        pass

class Ols(Regression):
    def __init__(self,y,*args,**kwds):
        Regression.__init__(self, y, *args, **kwds)

    def estimate(self):
        (y, x) = self.y, self.design
        #print 'y.shape, x.shape'
        #print y.shape, x.shape
        (self.b, self.resid, rank, self.sigma) = linalg.lstsq(x, y)
        #
##        yhat = ma.empty_like(self.ymasked)
##        mask = self.mask
##        if mask is not ma.nomask:
##            yhat[~mask, :] = np.dot(self.design, self.b)
##        else:
##            yhat[:,:] = np.dot(self.design, self.b)
        #self.yhat = yhat
        self.yhat = self.predict()
        #print rank

        # estimating coefficients, and basic stats
        self.inv_xx = linalg.pinv(np.dot(self.x.T,self.x)) # use Moore-Penrose pseudoinverse
        xy = np.dot(self.x.T,self.y)                   # estimate coefficients
        #print self.b.shape   # column vector
        self.nobs = self.y.shape[0]                     # number of observations
        self.ncoef = self.x.shape[1]                    # number of coef.
        self.df_e = self.nobs - self.ncoef              # degrees of freedom, error
        self.df_r = self.ncoef - 1                      # degrees of freedom, regression

        self.e = self.y - np.dot(self.x,self.b)            # residuals
        self.sse = np.dot(self.e.T,self.e)/self.df_e         # SSE
        self.se = np.sqrt(np.diagonal(self.sse*self.inv_xx))[:,np.newaxis]  # coef. standard errors
        self.t = self.b / self.se                       # coef. t-statistics
        self.p = (1-stats.t.cdf(np.abs(self.t), self.df_e)) * 2    # coef. p-values

        self.R2 = 1 - self.e.var()/self.y.var()         # model R-squared
        self.R2adj = 1-(1-self.R2)*((self.nobs-1)/(self.nobs-self.ncoef))   # adjusted R-square

        self.F = (self.R2/self.df_r) / ((1-self.R2)/self.df_e)  # model F-statistic
        self.Fpv = 1-stats.f.cdf(self.F, self.df_r, self.df_e)  # F-statistic p-value

    def dw(self):
        """
        Calculates the Durbin-Waston statistic
        """
        de = np.diff(self.e,1,axis=0)
        dw = np.dot(de.T,de) / np.dot(self.e.T,self.e);

        return dw

    def omni(self):
        """
        Omnibus test for normality
        """
        return stats.normaltest(self.e)

    def JB(self):
        """
        Calculate residual skewness, kurtosis, and do the JB test for normality
        """

        # Calculate residual skewness and kurtosis
        skew = stats.skew(self.e)
        kurtosis = 3 + stats.kurtosis(self.e)

        # Calculate the Jarque-Bera test for normality
        JB = (self.nobs/6) * (np.square(skew) + (1/4)*np.square(kurtosis-3))
        JBpv = 1-stats.chi2.cdf(JB,2);

        return JB, JBpv, skew, kurtosis

    def ll(self):
        """
        Calculate model log-likelihood and two information criteria
        """

        # Model log-likelihood, AIC, and BIC criterion values
        ll = -(self.nobs*1/2)*(1+np.log(2*np.pi)) - \
             (self.nobs/2)*np.log(np.dot(self.e.T,self.e)/self.nobs)
        aic = -2*ll/self.nobs + (2*self.ncoef/self.nobs)
        bic = -2*ll/self.nobs + (self.ncoef*np.log(self.nobs))/self.nobs

        return ll, aic, bic

    def summary(self):
        """
        Printing model output to screen
        """

        # local time & date
        t = time.localtime()

        # extra stats
        ll, aic, bic = self.ll()
        JB, JBpv, skew, kurtosis = self.JB()
        omni, omnipv = self.omni()

        # printing output to screen
        print '\n=============================================================================='
        print "Dependent Variable: " + self.y_varnm
        print "Method: Ordinary Least Squares"
        print "Date: ", time.strftime("%a, %d %b %Y",t)
        print "Time: ", time.strftime("%H:%M:%S",t)
        print '# obs:           %5.0f' % self.nobs
        print '# variables:     %5.0f' % self.ncoef
        print '=============================================================================='
        print 'variable     coefficient     std. Error    t-statistic       prob.'
        print '=============================================================================='
        for i in range(self.ncoef):
            #print self.x_varnm[i],self.b[i],self.se[i],self.t[i],self.p[i]
            print '''% -5s          % 9.4f     % 9.4f     % 9.4f     % 9.4f''' % tuple(
                [self.x_varnm[i],self.b.ravel()[i],self.se.ravel()[i],self.t.ravel()[i],self.p.ravel()[i]])
        print '=============================================================================='
        print 'Models stats                           Residual stats'
        print '=============================================================================='
        print 'R-squared            % 9.4f         Durbin-Watson stat  % 9.4f' % tuple([self.R2, self.dw()])
        print 'Adjusted R-squared   % 9.4f         Omnibus stat        % 9.4f' % tuple([self.R2adj, omni])
        print 'F-statistic          % 9.4f         Prob(Omnibus stat)  % 9.4f' % tuple([self.F, omnipv])
        print 'Prob (F-statistic)   % 9.4f         JB stat             % 9.4f' % tuple([self.Fpv, JB])
        print 'Log likelihood       % 9.4f         Prob(JB)            % 9.4f' % tuple([ll, JBpv])
        print 'AIC criterion        % 9.4f         Skew                % 9.4f' % tuple([aic, skew])
        print 'BIC criterion        % 9.4f         Kurtosis            % 9.4f' % tuple([bic, kurtosis])
        print '=============================================================================='

    def predict(self, x=None):
        '''redurn prediction
        todo: add prediction error, confidence intervall'''
        if x is None:
            x = self.design
            yest = np.dot(x, self.b)
            if (not self.mask is None) and self.mask.any():
                # (not self.mask is None) for shorthand, not used currently
                # (mask is not ma.nomask) is not necessary b/c no shrink
                output = ma.masked_all_like(self.ymasked)
                output[~self.mask, :] = yest
                # does not remove mask for unmasked
                return output
            else:
                return yest

        else:
            x = ma.fix_invalid(x)
            if self.addconst:
                x = ma.column_stack((ma.ones(x.shape[0]), x))
            output = ma.dot(x, self.b)
            #maybe detend 
            mask = output.mask
            if (mask is not ma.nomask) and (not mask.any()):
                #difficult to read 2 negatives
                output = output.data
        return output

def cookbook_example():
    xxsingular = False#True
    x = np.linspace(0, 15, 40)
    a,b,c = 3.1, 42, -304.2
    y_true = a*x**2 + b*x + c
    y_meas = y_true + 100.01*np.random.standard_normal( y_true.shape )
    if xxsingular:
        xx = np.c_[x**2,x,2*x,np.ones(x.shape[0])]
    else:
        xx = np.c_[x**2,x,np.ones(x.shape[0])]

    x_varnm = ['const', 'x1','x2','x3','x4']

    k = xx.shape[1]
    #m = Ols(y_meas,xx,y_varnm = 'y',x_varnm = x_varnm[:k-1],addconst = False)
    m = Ols(y_meas,xx,y_varnm = 'y',x_varnm = x_varnm[:k],addconst = False)
    m.summary()

    #matplotlib ploting

    plt.title('Linear Regression Example')
    plt.plot(x,y_true,'g.--')
    plt.plot(x,y_meas,'k.')
    plt.plot(x,y_meas-m.e.flatten(),'r.-')
    plt.legend(['original','plus noise', 'regression'], loc='lower right')
    np.testing.assert_almost_equal(y_meas[:,np.newaxis]-m.e,m.yhat)
    return m

if __name__ == '__main__':

    import numpy as np
    #from olsexample import ols

    def generate_data(nobs):
       x = np.random.randn(nobs,2)
       btrue = np.array([[5,1,2]]).T
       y = np.dot(x, btrue[1:,:]) + btrue[0,:] + 0.5 * np.random.randn(nobs,1)
       return y,x

    y,x = generate_data(15)

    #benchmark no masked arrays, and one 2D array for x
    est = Ols(y[1:,:],x[1:,:])  # initialize and estimate with ols, constant added by default
    _est = _ini.Ols(y[1:,:],x[1:,:])  # initialize and estimate with ols, constant added by default
    print 'ols estimate'
    est.estimate()
    print est.b.T
    print _est.b.T
    #print np.array([[5,1,2]])  # true coefficients

    ynew,xnew = generate_data(3)
    ypred = est.predict(xnew)

    print '    ytrue        ypred        error'
    print np.c_[ynew, ypred, ynew - ypred]

    ypred = _est.predict(xnew)
    print np.c_[ynew, ypred, ynew - ypred]
    assert not isinstance(est.yhat, ma.MaskedArray)
    assert not isinstance(ypred, ma.MaskedArray)

    #case masked array y
    ym = y.copy()
    ym[0,:] = np.nan
    ym = ma.masked_array(ym, np.isnan(ym))
    estm1 = Ols(ym,x)
    _estm1 = _ini.Ols(ym,x)
    print estm1.b.T
    print estm1.yhat.shape
    print _estm1.b.T
    print _estm1.yhat.shape
    print 'yhat'
    print estm1.yhat[:10,:]
    print _estm1.yhat[:10,:]
    assert_almost_equal(estm1.yhat[1:,:], est.yhat)
    assert isinstance(estm1.yhat, ma.MaskedArray)
    assert not isinstance(estm1.predict(x[:3,]), ma.MaskedArray)
    # not sure about this one

    #masked y and several x args, addconst=False
    estm2 = Ols(ym,np.ones(ym.shape),x[:,0],x[:,1],addconst=False)
    _estm2 = _ini.Ols(ym,np.ones(ym.shape),x[:,0],x[:,1],addconst=False)
    print estm2.b.T
    print _estm2.b.T
    assert_almost_equal(estm2.b, estm1.b)
    assert_almost_equal(estm2.yhat, estm1.yhat)
    assert isinstance(estm2.yhat, ma.MaskedArray)

    #masked y and several x args, 
    estm3 = Ols(ym,x[:,0],x[:,1]) 
    _estm3 = _ini.Ols(ym,x[:,0],x[:,1])
    print estm3.b.T
    print _estm3.b.T
    assert_almost_equal(estm3.b, estm1.b)
    assert_almost_equal(estm3.yhat, estm2.yhat)
    assert isinstance(estm3.yhat, ma.MaskedArray)

    #masked array in y and one x variable
    x_0 = x[:,0].copy()  # is copy necessary?
    x_0[0] = np.nan
    x_0 = ma.masked_array(x_0, np.isnan(x_0))
    estm4 = Ols(ym,x_0,x[:,1])
    _estm4 = Ols(ym,x_0,x[:,1])
    print estm4.b.T
    print _estm4.b.T
    assert_almost_equal(estm4.b, estm1.b)
    assert_almost_equal(estm4.yhat, estm1.yhat)
    assert isinstance(estm4.yhat, ma.MaskedArray)

    #masked array in one x variable, but not in y
    x_0 = x[:,0].copy()  # is copy necessary?
    x_0[0] = np.nan
    x_0 = ma.masked_array(x_0, np.isnan(x_0))
    estm5 = Ols(y,x_0,x[:,1])   #, addconst=False)
    _estm5 = _ini.Ols(y,x_0,x[:,1])   #, addconst=False)
    print estm5.b.T
    print _estm5.b.T
    assert_almost_equal(estm5.b, estm1.b)
    assert_almost_equal(estm5.yhat, estm2.yhat)
    assert isinstance(estm5.yhat, ma.MaskedArray)
    #assert np.all(estm5.yhat == estm1.yhat)

    #example polynomial
    print 'example with one polynomial x added'
    estv = Ols(y,np.vander(x[:,0],3), x[:,1], addconst=False)
    _estv = _ini.Ols(y,np.vander(x[:,0],3), x[:,1], addconst=False)
    print estv.b.T
    print _estv.b.T
    print estv.yhat
    print _estv.yhat
    assert not isinstance(estv.yhat, ma.MaskedArray)

    m = cookbook_example()