[SciPy-user] Stepwise Discriminant Analysis
josef.pktd at gmail.com
josef.pktd at gmail.com
Wed Mar 4 00:49:30 EST 2009
On Tue, Mar 3, 2009 at 6:49 PM, <josef.pktd at gmail.com> wrote:
> On Tue, Mar 3, 2009 at 4:43 PM, Nuttall, Brandon C <bnuttall at uky.edu> wrote:
>> Hello,
>>
>> Is a stepwise discriminant analysis routine available in Python? Can someone point me in the right direction?
>>
>
> I don't know of any stepwise implementation and searching with google
> I find mostly articles with the message:
>
> "Stepwise Descriptive or Predictive Discriminant Analysis: Don't Even
> Think about Using It!"
>
> is the title of:
> http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?_nfpb=true&_&ERICExtSearch_SearchValue_0=ED438321&ERICExtSearch_SearchType_0=no&accno=ED438321
>
> or see http://epm.sagepub.com/cgi/content/refs/55/4/525
>
> There are several python packages on machine learning and classifier
> analysis that might do something similar to what you want.
> e.g. http://www.pymvpa.org/classifiers.html#basic-supervised-learning-methods
> or logistic regression in nipy neuroimaging stats.models
> FDANode in http://mdp-toolkit.sourceforge.net/tutorial.html#node-list
> and there are several others.
>
> However, I don't know how easy it is, to use just the discriminant
> analysis without the overhead of learning an entire package, and for
> just a linear discriminant analysis they might be overkill.
>
> Josef
>
Just a clarification: Statisticians are mostly critical of many
stepwise statistical methods, among other reasons, because the usually
reported statistics are not correct if explanatory variables are
selected, and because if the selection of included explanatory
variables is based on in-sample fit, then some variables might capture
idiosyncratic features of the sample which is not useful for
out-of-sample prediction. It also creates problems for interpretation
of the results of the statistical analysis.
On the other hand, machine learning literature doesn't care much about
the interpretation of the estimated parameters, compared to for
example social sciences. So they can throw everything in the pot and
use cross validation to keep or select those features that help in
prediction.
An interesting comparison is in
http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/
and some of the links there.
Since I wanted to have something similar for another purpose, I wrote
a version of the linear probability model, i.e. regress a binary
variable on some continuous variables, where regressors are
sequentially included if they improve out-of-sample prediction. This
is not directly a stepwise discriminant analysis but close to it and
since it uses out-of-sample prediction similar to cross-validation, it
avoids the problem of overfitting, it doesn't have any test
statistics. However, a statistician would recommend to use a logit or
probit model instead.
This was quickly written and only minimally tested but it seems to
work pretty well for the generated sample.
(I guess this went a bit off-topic.)
Josef
-------------- next part --------------
import numpy as np
from numpy import linalg
def gendata(nreg, nobs, noise_sig):
'''generate sample data
This function only generates independent regressors and not regressors
with collinearity.
Returns
-------
y : array (nobs, 1) of binary data
xdata : array (nobs, nreg) of continuous regressor data
'''
nobsf = float(nobs)
xdata = np.hstack((np.ones((nobs,1)), np.random.randn(nobs,nreg)))
beta = np.array([[1, 1, 2, -1, 0.1, 0]+[0.05]*(nreg-5)]).T
noise = noise_sig * np.random.randn(nobs,1)
ylatent = np.dot(xdata,beta) + noise
y = (ylatent > 0).astype(int)
return xdata, y
def estimate(xsample, ysample, xout, yout, full=0):
'''estimate linear regression on sample data
returns percent of correct predictions for out-of-sample data
'''
betaest = linalg.lstsq(xsample,ysample)[0]
yout_latent_est = np.dot(xout,betaest)
yout_est = (yout_latent_est > 0.5).astype(int)
ycorrect = sum((yout_est - yout) == 0)
nyoutf = float(yout.shape[0])
if not full:
return ycorrect/nyoutf
else:
yover = sum((yout_est - yout) > 0)
yunder = sum((yout_est - yout) < 0)
return betaest, ycorrect/nyoutf*100, yover/nyoutf*100, yunder/nyoutf*100
def stepregress(xdata, y, psample=0.8):
'''regress binary random variable on a array of explanatory variables
Regressors are added sequentially, one at a time, by best improvement
in out-of-sample prediction. If no additional regressor improves the
prediction then the iteration stops
'''
nobs = y.shape[0]
nreg = xdata.shape[1]
nsample = np.floor(nobs * psample)
nout = nobs - nsample
xsample = xdata[:nsample,:]
ysample = y[:nsample,:]
xout = xdata[nsample:,:]
yout = y[nsample:,:]
idxincl = []
idxremain = range(nreg)
bestcorr = 0
oldbest = 0
while len(idxremain)>0:
bestidx = 0
print 'remaining', idxremain
for xidx in idxremain:
idxuse = idxincl + [xidx]
correctpercent = estimate(xsample[:,idxuse], ysample, xout[:,idxuse], yout)
print idxuse, correctpercent
if correctpercent > bestcorr:
bestidx = xidx
bestcorr = correctpercent
if bestcorr > oldbest:
idxincl.append(bestidx)
idxremain.remove(bestidx)
oldbest = bestcorr
print 'added: ', bestidx, 'new correct prediction', bestcorr
else:
print 'no improvement found'
break
return idxincl, idxremain
if __name__ == '__main__':
nreg = 10 #I tried 5, 10 and 20
nobs = 100 #I tried 100, 500 and 1000
noise_sig = 0.1 #standard deviation of noise of latent y variable
psample = 0.8 #fraction of data to include in estimation
xdata, y = gendata(nreg, nobs, noise_sig)
idxincl, idxremain = stepregress(xdata, y, psample=psample)
print '\nSummary'
print 'nobs=%d, nreg=%d, noise_std=%4.2f, psample=%3.2f'% (nobs, nreg,
noise_sig, psample)
print 'included regressors', idxincl
print 'excluded regressors', idxremain
#prediction, classification on the full data set with regressors selected
# by stepregress:
betaest, ycorrect, yover, yunder = estimate(xdata[:,idxincl], y,
xdata[:,idxincl], y, full=1)
print 'percent label 1 in data', sum(y)/float(nobs)*100
print 'ycorrect, yover, yunder (%) =', ycorrect, yover, yunder
print 'estimated parameters for regressors', idxincl
print betaest.T
More information about the SciPy-User
mailing list