getting a submatrix of all true

Thu Jul 3 00:25:40 EDT 2003

On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jdhunter at ace.bsd.uchicago.edu> wrote:

>
>I have a largish data set (1000 observations x 100 floating point
>variables), and some of the of the data are missing.  I want to try a
>variety of clustering, neural network, etc, algorithms on the data,
>and to keep life simple I want to reduce the dimensions of the matrix
>so that I have no missing values, since not all the algorithms are
>able to handle them and there is sufficient redundancy in the
>variables that I can afford to lose some.
>
>I am currently using a hack that works, but it makes me wonder if
>there is an optimal solution.  I define optimal as the removal of rows
>and columns such that there are no missing values and
>max(numRows*numCols).
>
>My current approach is to drop rows (observations) that have more than
>some prespecified number of missing variables, and then drop the
>columns (variables) of the reduced data set that have any missing
>values.  I chose the threshold for dropping a row by eyeballing the
>distribution of number of missing variables per observation, pick a
>number on the low end of the distribution, and dropping the rows that
>exceed the threshold.
>
>Another way of formulating the question: for a sparse boolean matrix
>(sparse on True), what is the optimal way to remove rows and columns
>so that the total number of elements in the matrix is maximal and
>there are no True values left.
>
>
>Example:
>   
>  0 0 0 
>  0 0 0 candidate sub matrix has 12 elements
>  0 0 0       
>  0 0 0 
>
>1 0 0 0 1
>0 0 0 0 0    0 0 0 0 0    
>0 0 0 0 0    0 0 0 0 0  candidate submatrix has 15 elements
>0 0 0 0 0    0 0 0 0 0 
>0 0 1 0 0   
>
>      0 0
>      0 0 candidate submatrix has 8 elements
>      0 0 
>      0 0 
>
>I want to programatically extract the 15 element matrix

If I understand your original optimality definition, that would be suboptimal.
I.e., how about the 16-element matrix? (x's mark the corresponding zeroes)

 1 0 0 0 1
 x x 0 x x
 x x 0 x x
 x x 0 x x
 x x 1 x x   

Or do they have to be adjacent?

>
>Following the approach described above, I get the desired answer in
>the example below, though this is a hack solution and I have the
>feeling there is a better one.
>
>    from Numeric import nonzero, array, take, sum
>
>    X = array([[1, 0, 0, 0, 1],
>               [0, 0, 0, 0, 0],
>               [0, 0, 0, 0, 0],
>               [0, 0, 0, 0, 0],
>               [0, 0, 1, 0, 0]])
>
>    goodObsInd = nonzero(sum(X,1)<2)  # observations with < 2 missing variables
>    X = take(X, goodObsInd)           # drop the bad
>
>    goodVarInd = nonzero(sum(X)==0)   # variables with no missing data
>    X = take(X, goodVarInd, 1 )       # drop the bad variables
>
>    print X
>

Brute force seems to work on this little example. Maybe it can
be memoized and optimized and/or whatever to handle your larger matrix fast enough?

====< submatrix.py >================================
X = [
    [1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0]
]

def optimrcs(mx, rows=[], cols=[]):
    if not rows or not cols: return 0,[],[]
    maxscore = 0
    for r in rows:
        if not mx[r].count(1): continue
        for c in cols:
            if not mx[r][c]: continue
            # eval column submatrix vs row submatrix
            colsxthis = cols[:]; colsxthis.remove(c)
            colscore, rset, cset = optimrcs(mx, rows, colsxthis)
            if colscore>maxscore:
                maxscore, maxrows, maxcols = colscore, rset, cset 
            rowsxthis = rows[:]; rowsxthis.remove(r)
            rowscore, rset, cset = optimrcs(mx, rowsxthis, cols)
            if rowscore >= maxscore:
                maxscore, maxrows, maxcols = rowscore, rset, cset 
    if not maxscore:
        return len(rows)*len(cols), rows, cols
    return maxscore, maxrows,maxcols

if __name__ == '__main__':
    score, rowsel, colsel = optimrcs(X, range(5), range(5))
    print 'Score = %s, rows = %s, cols = %s' % (score,rowsel,colsel)
    print
    for r in range(5):
        row = X[r]
        for c in range(5):
            if r in rowsel and c in colsel:
                print '<%s>'% row[c],
            else: 
                print ' %s '% row[c],
        print
====================================================
Running it results thus:

[21:24] C:\pywk\clp>submatrix.py
Score = 16, rows = [1, 2, 3, 4], cols = [0, 1, 3, 4]

 1   0   0   0   1
<0> <0>  0  <0> <0>
<0> <0>  0  <0> <0>
<0> <0>  0  <0> <0>
<0> <0>  1  <0> <0>

Regards,
Bengt Richter