getting a submatrix of all true

Mon Jul 7 18:47:31 EDT 2003

>>>>> "John" == John Hunter <jdhunter at ace.bsd.uchicago.edu> writes:

    John> I think I may adapt your algorithm to terminate when the
    John> percent missing falls below some threshold, removing the
    John> requirement that *all* missing values be removed.  This will
    John> drop the worst offenders observation or variable wise.  Then
    John> I can use a regression approach to fill in the remaining
    John> ones.

I think I have found a major improvement on your algorithm Jon, by
changing the rule which determines whether to drop a row or a column
after you have found the worst offenders for each.  The original
approach was to keep the one with the most valid (zero) elements,
which makes sense since that is the quantity you are trying to
maximize.  But consider the case where there is correlation so that
true values on a row or column tend to be correlated with other true
values on that row or column, as in

X = 1 0 
    1 0
    1 0
    1 0
    1 0
    0 0 
    0 0

Your algorithm would keep the column, since keeping a row gives you
only one zero whereas keeping a column gives you 2.  By looking at the
*fraction* of ones in the worst offending row and column, you can do
much better.  By deleting the row or column with the greatest fraction
of ones, you reduce the likelihood that you'll encounter a one in
future iterations.  Using this approach, I was able to find a 893x67
submatrix with a score of 59831 versus the 462x81 submatrix with a
score of 37422.

I have also added thresholds to the code, so that you can terminate
with a fraction of missing values remaining.  By setting the
thresholds to zero, you get the special case of no missing values.

The example below also includes a no-copy optimization that Jon
emailed me off list.  The algorithm run time for a 1073x82 matrix is
around 30ms on my 700MHz machine.

John Hunter

from __future__ import division
import sys, time
from urllib import urlopen
from Numeric import *

def trim_missing(a, thresh=(0.05,0.05) ):
    """
    a is a numObservations by numVariables boolean matrix

    thresh = (othresh, vthresh) are the observation and variable
    thresholds that trigger deletion

      remove any rows where percent missing vars > vthresh
      remove any cols where percent missing obs > othresh

    This is done iteratively rechecking the percent missing on each
    iteration or row or col removal, with the largest offender removed
    on each iteration

    return value is score, rowInd, colInd where the two ind arrays
    are the indices of the rows and cols that were kept

    """
    othresh, vthresh = thresh # theshold for deletion
    dr=zeros(a.shape[0])   # Arrays to note deleted rows...
    dc=zeros(a.shape[1])   # ... and columns
    sizeleft=list(a.shape) # Remaining dimensions
    sr=sum(a,1)         # Column scores = ij's to remove 
    sc=sum(a,0)         # Row scores
    while 1:               # Keep deleting till none left    
        mr=argmax(sr)       # index highest row score
        mc=argmax(sc)       # index highest column score
        fracRow = sr[mr]/sizeleft[1]
        fracCol = sc[mc]/sizeleft[0]

        if fracRow<=vthresh and fracCol<=othresh: 
            break            # stop when missing criteria are satisfied
        if fracRow-vthresh<fracCol-othresh:
            sr=sr-a[:,mc]    # rows reduced by contents of column
            sc[mc]=0         # sum of column now zero
            dc[mc]=1         # tags column as deleted
            sizeleft[1]-=1
        else:   
            # deleting the row
            sc=sc-a[mr,:]    # columns reduced by contents of row
            sr[mr]=0         # sum of row is now zero
            dr[mr]=1         # tags row as deleted
            sizeleft[0]-=1

    score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
    dr=compress(dr==0,range(a.shape[0]))  # gives return format of  
    dc=compress(dc==0,range(a.shape[1]))  # optimrcs posted by Bengt

    return score, dr, dc

url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
L = [ [ int(val) for val in line.split()] 
      for line in urlopen(url).readlines()]
a= array(L)
start = time.time()
score, goodRows, goodCols = trim_missing(a, thresh=(0.0, 0.0))
end = time.time()
print end-start
print 'Original size', a.shape
print 'Number of non-zeros in', sum(ravel(a))
print 'Reduced size', len(goodRows), len(goodCols)
print 'Score', score