Best match searching

Brian Kelley bkelley at wi.mit.edu
Tue Dec 30 12:17:16 EST 2003


Daniel Pryde wrote:
> The array is implemented simply using a list of lists. The only solution 
> I could think of is to use some for statements, but that seems to take a 
> while, even for just one search.
Ahh this brings back memories of my image processing days :)

I've implemented a couple of approaches.
1) using just one list to hold the data
2) using a list of lists
3) using numeric

I've optimized the search function for 1 and 2 a little bit by defining 
the scoring function on the fly like this and binding input during the 
function definition:

input = 0.5
def euclidean(x, input=input): return abs(x-input)

We can also use a bounding box technique such that if a minimum x is 
found and the x < input then we can reject all points less than x 
without having to perform the euclidean distance.  Conversely if x > 
input then we can reject all points greater than x.  This really only 
works if x is a single value and not a vector.

Don't forget that when looking for minimums using a euclidean distance 
you don't have to perform the square root.  (x1*x1+x2*x2) will find the 
correct minimum as well as opposed to the more time consuming 
math.sqrt(x1*x1+x2*x2)

Here are the results looking for data closest to an input value of 0.5 
using a pretty loaded pentium 4 2ghz.  The result is for a single search 
against a random array of 600x400.

time           row col value
0.130000114441 396 328 0.500001351171 single list
0.119999885559 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

So numeric is the clear winner here.  However, when psyco is used 
http://psyco.sourceforge.net/

time            row col value
0.0400000810623 396 328 0.500001351171 single list
0.0499999523163 396 328 0.500001351171 list of lists
0.0600000619888 396 328 0.500001351171 matrix

psyco actually wins here but more interesting is that the timing order 
is reversed!  I'll have to do a lot more trials to generate good 
profiling, but I think the results are pretty good here.

Here is the test code, feel free to rip it apart as you will.
import Numeric, random, time

# create data
data = [random.random() for x in xrange(600*400)]

# test list of list
lists = []
for i in range(400): # rows
     lists.append(data[i*600:(i+1)*600]) # columns

matrix = Numeric.array(lists)

def testarray(data=data):
     t1 = time.time()
     # straight forward approach
     mn = 10e99
     index = None
     lx = -10e99 # keep track of the best found
     hx = 10e99  # approaching from the bottom and top

     input = 0.5
     def euclidean(x, input=input):
         return abs(x-input)

     for i,x in enumerate(data):
         if x < lx or x > hx: # can we reject this point?
             continue
         y = euclidean(x)
         if y < mn:
             if x < input:
                 lx = x
             else:
                 hx = x
             mn = y
             index = i

     t2 = time.time()
     row = index/600
     col = index%600
     print t2-t1, row, col, data[index], "single list"

def testlists(list=list):
     mn = 10e99
     index = None
     input = 0.5
     lx = -10e99 # keep track of the best found
     hx = 10e99  # approaching from the bottom and top

     def euclidean(x, input=input):
         return abs(x-input)
     t1 = time.time()
     for r, row in enumerate(lists):
         for c, x in enumerate(row):
             if x < lx or x > hx: # can we reject this point?
                 continue
             y = euclidean(x)
             if y < mn:
                 if x < input:
                     lx = x
                 else:
                     hx = x
                 mn = y
                 index = r,c

     t2 = time.time()
     r,c = index
     print t2-t1, r,c, lists[r][c], "list of lists"

def testmatrix(matrix=matrix):
     t1 = time.time()
     mins = abs(matrix-0.5)
     mn = 10e99
     index = None
     for r,c in enumerate(Numeric.argmin(mins)):
         y = mins[r,c]
         if y < mn:
             mn = y
             index = r,c
     t2 = time.time()
     r,c = index
     print t2-t1, r,c, lists[r][c], "matrix"

# list of lists approach
testarray()
testlists()
testmatrix()

print "*"*44
print "try with psyco"
try:
     import psyco
     psyco.full()
     testarray()
     testlists()
     testmatrix()
except ImportError:
     print "psyco not available"






More information about the Python-list mailing list