getting a submatrix of all true
Jon
wright at esrf.fr
Fri Jul 4 14:16:41 EDT 2003
bokr at oz.net (Bengt Richter) wrote in message news:<be0b84$89a$0 at 216.39.172.122>...
> On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jdhunter at ace.bsd.uchicago.edu> wrote:
> >I am currently using a hack that works, but it makes me wonder if
> >there is an optimal solution. I define optimal as the removal of rows
> >and columns such that there are no missing values and
> >max(numRows*numCols).
There must be! For each missing entry you can choose to drop either
the row or the column - giving a maximum of 2^N possible ways for N
missing elements. Given that the order in which the rows/columns are
deleted is unimportant the number of possibilities to test is less
than that, but the difficulty is that your decision about row or
column for one element affects the decision for other elements if they
share rows or columns. Graph theoretic approaches probably help - it's
similar to reordering matrix elements to avoid fill in when
decomposing sparse matrices. Depending on N, you might be able to test
all possibilities in a reasonable time, but I have a nasty feeling you
need to check all 2^M to be sure of an optimal solution, where M is
the subset of N which either a share a row or column with another
problematic element.
> >Another way of formulating the question: for a sparse boolean matrix
> >(sparse on True), what is the optimal way to remove rows and columns
> >so that the total number of elements in the matrix is maximal and
> >there are no True values left.
Sadly I can only give you a method which is certainly not optimal, but
at least finds something which is probably not too bad, and fairly
quickly. Uses the Numeric module to speed up, so that a few thousand
by a few hundred is quickly computable. I'm curious to know if it does
much better or worse than other algorithms.
Interesting problem!
Cheers,
Jon
=====================================================================
from Numeric import *
def pickrcs(a,debug=0):
b=array(a,copy=1) # local copy of array is modified
dr=zeros(b.shape[0]) # Arrays to note deleted rows...
dc=zeros(b.shape[1]) # ... and columns
sizeleft=[b.shape[0],b.shape[1]] # Remaining dimensions
while 1: # Keep deleting till none left
sr=sum(b,1) # Column scores = ij's to remove
sc=sum(b,0) # Row scores
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
if sr[mr]==0 and sc[mc]==0 :
break # stop when nothing to delete
# pick the row/column to delete fewest useful elements
if sizeleft[1]-sr[mr] > sizeleft[0]-sc[mc]:
b[:,mc]=0 # Zero out column
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
b[mr,:]=0 # Zero out row
dr[mr]=1
sizeleft[0]-=1
# end of deletions loop - should be no missing elements now
#
# paranoia!, check if anything was left after deletions
if sum(dr)<b.shape[0] and sum(dc)<b.shape[1]:
dr=compress(dr==0,range(b.shape[0])) # gives return format of
dc=compress(dc==0,range(b.shape[1])) # optimrcs posted by Bengt
score=dr.shape[0]*dc.shape[0] # Richter
return score,dr,dc
else:
print "Sorry, didn't manage to let anything survive"
return 0,0,0
# test code
X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]
if __name__ == '__main__':
a=array(X)
score, rowsel, colsel = pickrcs(a)
print "score=",score
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print '<%s>'% row[c],
else:
print ' %s '% row[c],
print
# now try for a larger problem - use a random pattern
import RandomArray,time
start=time.time()
x0=1000 # problem size
x1=600
a=RandomArray.random((x0,x1))
a=where(a>0.999,1,0) # ~ 0.1% of elements are 1
print "Using a large random matrix"
print "Number of non-zeros=",sum(ravel(a)),"in",x0,"x",x1,"matrix"
score, rowsel, colsel = pickrcs(a)
print 'Score = %s' % (score)
print 'Number of rows deleted=',a.shape[0]-rowsel.shape[0]
print 'Number of columns deleted=',a.shape[1]-colsel.shape[0]
print time.time()-start," seconds"
==sample output (of course second part is random!)===============
score= 16
1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>
Using a large random matrix
Number of non-zeros= 607 in 1000 x 600 matrix
Score = 331776
Number of rows deleted= 424
Number of columns deleted= 24
3.38999998569 seconds
More information about the Python-list
mailing list