[Numpy-discussion] Identifying Colinear Columns of a Matrix
Charles R Harris
charlesr.harris at gmail.com
Fri Aug 26 14:04:07 EDT 2011
On Fri, Aug 26, 2011 at 11:41 AM, Mark Janikas <mjanikas at esri.com> wrote:
> I wonder if my last statement is essentially the only answer... which I
> wanted to avoid...
>
> Should I just use combinations of the columns and try and construct the
> corrcoef() (then ID whether NaNs are present), or use the condition number
> to ID the singularity? I just wanted to avoid the whole k! algorithm.
>
> MJ
>
> -----Original Message-----
> From: numpy-discussion-bounces at scipy.org [mailto:
> numpy-discussion-bounces at scipy.org] On Behalf Of Mark Janikas
> Sent: Friday, August 26, 2011 10:35 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> I actually use the VIF when the design matrix can be inverted.... I do it
> the quick and dirty way as opposed to the step regression:
>
> 1. Calc the correlation coefficient of the matrix (w/o the intercept)
> 2. Return the diagonal of the inversion of the correlation matrix in step
> 1.
>
> Again, the problem lies in the multiple column relationship... I wouldn't
> be able to run sub regressions at all when the columns are perfectly
> collinear.
>
> MJ
>
> -----Original Message-----
> From: numpy-discussion-bounces at scipy.org [mailto:
> numpy-discussion-bounces at scipy.org] On Behalf Of Skipper Seabold
> Sent: Friday, August 26, 2011 10:28 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas <mjanikas at esri.com> wrote:
> > Hello All,
> >
> >
> >
> > I am trying to identify columns of a matrix that are perfectly collinear.
> > It is not that difficult to identify when two columns are identical are
> have
> > zero variance, but I do not know how to ID when the culprit is of a
> higher
> > order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will
> > return NaNs when the matrix is singular, and LA.cond(matrix.T) will
> provide
> > a very large condition number.. But they do not tell me which columns are
> > causing the problem. For example:
> >
> >
> >
> > zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
> >
> > [ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
> >
> > [ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
> >
> > [ 3. , 8. , 0. , 5. , 0. ]])
> >
> >
> >
> > How can I identify that columns 0,1,2 are the issue because: column 1 +
> > column 2 = column 0?
> >
> >
> >
> > Any input would be greatly appreciated. Thanks much,
> >
>
> The way that I know to do this in a regression context for (near
> perfect) multicollinearity is VIF. It's long been on my todo list for
> statsmodels.
>
> http://en.wikipedia.org/wiki/Variance_inflation_factor
>
> Maybe there are other ways with decompositions. I'd be happy to hear about
> them.
>
> Please post back if you write any code to do this.
>
>
Why not svd?
In [13]: u,d,v = svd(zt)
In [14]: d
Out[14]:
array([ 1.01307066e+01, 1.87795095e+00, 3.03454566e-01,
3.29253945e-16])
In [15]: u[:,3]
Out[15]: array([ 0.57735027, -0.57735027, -0.57735027, 0. ])
In [16]: dot(u[:,3], zt)
Out[16]:
array([ -7.77156117e-16, -6.66133815e-16, -7.21644966e-16,
-7.77156117e-16, -8.88178420e-16])
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110826/0d74b87e/attachment.html>
More information about the NumPy-Discussion
mailing list