[SciPy-user] Stepwise Discriminant Analysis

Wed Mar 4 00:49:30 EST 2009

On Tue, Mar 3, 2009 at 6:49 PM,  <josef.pktd at gmail.com> wrote:
> On Tue, Mar 3, 2009 at 4:43 PM, Nuttall, Brandon C <bnuttall at uky.edu> wrote:
>> Hello,
>>
>> Is a stepwise discriminant analysis routine available in Python? Can someone point me in the right direction?
>>
>
> I don't know of any stepwise implementation and searching with google
> I find mostly articles with the message:
>
> "Stepwise Descriptive or Predictive Discriminant Analysis: Don't Even
> Think about Using It!"
>
> is the title of:
> http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?_nfpb=true&_&ERICExtSearch_SearchValue_0=ED438321&ERICExtSearch_SearchType_0=no&accno=ED438321
>
> or see http://epm.sagepub.com/cgi/content/refs/55/4/525
>
> There are several python packages on machine learning and classifier
> analysis that might do something similar to what you want.
> e.g. http://www.pymvpa.org/classifiers.html#basic-supervised-learning-methods
> or logistic regression in nipy neuroimaging stats.models
> FDANode in http://mdp-toolkit.sourceforge.net/tutorial.html#node-list
> and there are several others.
>
> However, I don't know how easy it is, to use just the discriminant
> analysis without the overhead of learning an entire package, and for
> just a linear discriminant analysis they might be overkill.
>
> Josef
>

Just a clarification: Statisticians are mostly critical of many
stepwise statistical methods, among other reasons, because the usually
reported statistics are not correct if explanatory variables are
selected, and because if the selection of included explanatory
variables is based on in-sample fit, then some variables might capture
idiosyncratic features of the sample which is not useful for
out-of-sample prediction. It also creates problems for interpretation
of the results of the statistical analysis.

On the other hand, machine learning literature doesn't care much about
the interpretation of the estimated parameters, compared to for
example social sciences. So they can throw everything in the pot and
use cross validation to keep or select those features that help in
prediction.
An interesting comparison is in
http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/
and some of the links there.

Since I wanted to have something similar for another purpose, I wrote
a version of the linear probability model, i.e. regress a binary
variable on some continuous variables, where regressors are
sequentially included if they improve out-of-sample prediction. This
is not directly a stepwise discriminant analysis but close to it and
since it uses out-of-sample prediction similar to cross-validation, it
avoids the problem of overfitting, it doesn't have any test
statistics. However, a statistician would recommend to use a logit or
probit model instead.
This was quickly written and only minimally tested but it seems to
work pretty well for the generated sample.

(I guess this went a bit off-topic.)

Josef
-------------- next part --------------

import numpy as np
from numpy import linalg

def gendata(nreg, nobs, noise_sig):
    '''generate sample data

    This function only generates independent regressors and not regressors
    with collinearity.

    Returns
    -------
    y : array (nobs, 1) of binary data
    xdata : array (nobs, nreg) of continuous regressor data

    '''
    nobsf = float(nobs)
    xdata = np.hstack((np.ones((nobs,1)), np.random.randn(nobs,nreg)))
    beta = np.array([[1, 1, 2, -1, 0.1, 0]+[0.05]*(nreg-5)]).T
    noise = noise_sig * np.random.randn(nobs,1)
    ylatent = np.dot(xdata,beta) + noise
    y = (ylatent > 0).astype(int)
    return xdata, y

def estimate(xsample, ysample, xout, yout, full=0):
    '''estimate linear regression on sample data

    returns percent of correct predictions for out-of-sample data
    '''

    betaest = linalg.lstsq(xsample,ysample)[0]
    yout_latent_est = np.dot(xout,betaest)
    yout_est = (yout_latent_est > 0.5).astype(int)
    ycorrect = sum((yout_est - yout) == 0)
    nyoutf = float(yout.shape[0])
    if not full:
        return ycorrect/nyoutf
    else:
        yover = sum((yout_est - yout) > 0)
        yunder = sum((yout_est - yout) < 0)
        return betaest, ycorrect/nyoutf*100, yover/nyoutf*100, yunder/nyoutf*100

def stepregress(xdata, y, psample=0.8):
    '''regress binary random variable on a array of explanatory variables

    Regressors are added sequentially, one at a time, by best improvement
    in out-of-sample prediction. If no additional regressor improves the
    prediction then the iteration stops
    '''
    nobs = y.shape[0]
    nreg = xdata.shape[1]
    nsample = np.floor(nobs * psample)
    nout = nobs - nsample
    xsample = xdata[:nsample,:]
    ysample = y[:nsample,:]
    xout = xdata[nsample:,:]
    yout = y[nsample:,:]

    idxincl = []
    idxremain = range(nreg)

    bestcorr = 0
    oldbest = 0
    while len(idxremain)>0:
        bestidx = 0
        print 'remaining', idxremain
        for xidx in idxremain:
            idxuse = idxincl + [xidx]
            correctpercent = estimate(xsample[:,idxuse], ysample, xout[:,idxuse], yout)
            print idxuse, correctpercent
            if correctpercent > bestcorr:
                bestidx = xidx
                bestcorr = correctpercent

        if bestcorr > oldbest:
            idxincl.append(bestidx)
            idxremain.remove(bestidx)
            oldbest = bestcorr
            print 'added: ', bestidx, 'new correct prediction', bestcorr
        else:
            print 'no improvement found'
            break
    return idxincl, idxremain

if __name__ == '__main__':
    nreg = 10     #I tried 5, 10 and 20
    nobs = 100    #I tried 100, 500 and 1000
    noise_sig = 0.1  #standard deviation of noise of latent y variable
    psample = 0.8    #fraction of data to include in estimation

    xdata, y = gendata(nreg, nobs, noise_sig)

    idxincl, idxremain = stepregress(xdata, y, psample=psample)

    print '\nSummary'
    print 'nobs=%d, nreg=%d, noise_std=%4.2f, psample=%3.2f'% (nobs, nreg,
                                                             noise_sig, psample)
    print 'included regressors', idxincl
    print 'excluded regressors', idxremain

    #prediction, classification on the full data set with regressors selected
    #   by stepregress:

    betaest, ycorrect, yover, yunder = estimate(xdata[:,idxincl], y,
                                               xdata[:,idxincl], y, full=1)
    print 'percent label 1 in data', sum(y)/float(nobs)*100
    print 'ycorrect, yover, yunder (%) =', ycorrect, yover, yunder
    print 'estimated parameters for regressors', idxincl
    print betaest.T