getting a submatrix of all true

Jon wright at esrf.fr
Fri Jul 4 14:16:41 EDT 2003


bokr at oz.net (Bengt Richter) wrote in message news:<be0b84$89a$0 at 216.39.172.122>...
> On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jdhunter at ace.bsd.uchicago.edu> wrote:
> >I am currently using a hack that works, but it makes me wonder if
> >there is an optimal solution.  I define optimal as the removal of rows
> >and columns such that there are no missing values and
> >max(numRows*numCols).

There must be! For each missing entry you can choose to drop either
the row or the column - giving a maximum of 2^N possible ways for N
missing elements. Given that the order in which the rows/columns are
deleted is unimportant the number of possibilities to test is less
than that, but the difficulty is that your decision about row or
column for one element affects the decision for other elements if they
share rows or columns. Graph theoretic approaches probably help - it's
similar to reordering matrix elements to avoid fill in when
decomposing sparse matrices. Depending on N, you might be able to test
all possibilities in a reasonable time, but I have a nasty feeling you
need to check all 2^M to be sure of an optimal solution, where M is
the subset of N which either a share a row or column with another
problematic element.

> >Another way of formulating the question: for a sparse boolean matrix
> >(sparse on True), what is the optimal way to remove rows and columns
> >so that the total number of elements in the matrix is maximal and
> >there are no True values left.

Sadly I can only give you a method which is certainly not optimal, but
at least finds something which is probably not too bad, and fairly
quickly. Uses the Numeric module to speed up, so that a few thousand
by a few hundred is quickly computable. I'm curious to know if it does
much better or worse than other algorithms.

Interesting problem!

Cheers,

Jon

=====================================================================
from Numeric import *

def pickrcs(a,debug=0):
   b=array(a,copy=1)      # local copy of array is modified
   dr=zeros(b.shape[0])   # Arrays to note deleted rows...
   dc=zeros(b.shape[1])   # ... and columns
   sizeleft=[b.shape[0],b.shape[1]]  # Remaining dimensions
   while 1:               # Keep deleting till none left    
      sr=sum(b,1)         # Column scores = ij's to remove 
      sc=sum(b,0)         # Row scores
      mr=argmax(sr)       # index highest row score
      mc=argmax(sc)       # index highest column score
      if sr[mr]==0 and sc[mc]==0 : 
         break            # stop when nothing to delete
      # pick the row/column to delete fewest useful elements
      if sizeleft[1]-sr[mr] > sizeleft[0]-sc[mc]:   
         b[:,mc]=0        # Zero out column
         dc[mc]=1         # tags column as deleted
         sizeleft[1]-=1
      else:   
         b[mr,:]=0        # Zero out row
         dr[mr]=1
         sizeleft[0]-=1
   # end of deletions loop - should be no missing elements now
   #
   # paranoia!, check if anything was left after deletions
   if sum(dr)<b.shape[0] and sum(dc)<b.shape[1]: 
      dr=compress(dr==0,range(b.shape[0]))  # gives return format of  
      dc=compress(dc==0,range(b.shape[1]))  # optimrcs posted by Bengt
      score=dr.shape[0]*dc.shape[0]         # Richter
      return score,dr,dc
   else:
      print "Sorry, didn't manage to let anything survive"
      return 0,0,0

# test code

X = [
    [1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0]
]

if __name__ == '__main__':
    a=array(X)
    score, rowsel, colsel = pickrcs(a)
    print "score=",score
    for r in range(5):
        row = X[r]
        for c in range(5):
            if r in rowsel and c in colsel:
                print '<%s>'% row[c],
            else: 
                print ' %s '% row[c],
        print
    # now try for a larger problem - use a random pattern
    import RandomArray,time
    start=time.time()
    x0=1000  # problem size
    x1=600
    a=RandomArray.random((x0,x1))
    a=where(a>0.999,1,0)                # ~ 0.1% of elements are 1
    print "Using a large random matrix"
    print "Number of non-zeros=",sum(ravel(a)),"in",x0,"x",x1,"matrix"
    score, rowsel, colsel = pickrcs(a)
    print 'Score = %s' % (score)
    print 'Number of rows deleted=',a.shape[0]-rowsel.shape[0]
    print 'Number of columns deleted=',a.shape[1]-colsel.shape[0]
    print time.time()-start," seconds" 



==sample output (of course second part is random!)===============

score= 16
 1   0   0   0   1
<0> <0>  0  <0> <0>
<0> <0>  0  <0> <0>
<0> <0>  0  <0> <0>
<0> <0>  1  <0> <0>
Using a large random matrix
Number of non-zeros= 607 in 1000 x 600 matrix
Score = 331776
Number of rows deleted= 424
Number of columns deleted= 24
3.38999998569  seconds




More information about the Python-list mailing list