getting a submatrix of all true

John Hunter jdhunter at ace.bsd.uchicago.edu
Wed Jul 2 15:16:57 EDT 2003


I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing.  I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution.  I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values.  I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.


Example:
   
  0 0 0 
  0 0 0 candidate sub matrix has 12 elements
  0 0 0       
  0 0 0 

1 0 0 0 1
0 0 0 0 0    0 0 0 0 0    
0 0 0 0 0    0 0 0 0 0  candidate submatrix has 15 elements
0 0 0 0 0    0 0 0 0 0 
0 0 1 0 0   

      0 0
      0 0 candidate submatrix has 8 elements
      0 0 
      0 0 

I want to programatically extract the 15 element matrix

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

    from Numeric import nonzero, array, take, sum

    X = array([[1, 0, 0, 0, 1],
               [0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0],
               [0, 0, 0, 0, 0],
               [0, 0, 1, 0, 0]])

    goodObsInd = nonzero(sum(X,1)<2)  # observations with < 2 missing variables
    X = take(X, goodObsInd)           # drop the bad

    goodVarInd = nonzero(sum(X)==0)   # variables with no missing data
    X = take(X, goodVarInd, 1 )       # drop the bad variables

    print X


John Hunter





More information about the Python-list mailing list