[Numpy-discussion] Optimizing similarity matrix algorithm

Wed Sep 12 06:48:24 EDT 2007

Er.. obviousl, when i wrote : "scan the similarity algorithm and find all
the diagonals", I meant scan the "similarity matrix and find all the
diagonals".

"Similarity matrix" should really be called "Equality matrix", as I imagine
it as a matrix with dimensions len(a) x len(b) where M[x][y] = (a[x] ==
b[y])

On 9/12/07, Kurdt Bane <kurdt.bane at gmail.com> wrote:
>
> Hi to all!
> For reverse engineering purposes, I need to find where every possible
> chunk of bytes in file A is contained in file B. Obviously, if a chunk of
> length n is contained in B, I dont' want my script to recognize also all the
> subchunks of size < n contained in the chunk.
>
> I coded a naive implementation of a similarity algorithm: scan the
> similarity algorithm and find all the diagonals. Here's the code:
>
> file_a = open(argv[1])
> file_b = open(argv[2])
>
> a = numpy.fromstring (file_a.read(),'c')
> b = numpy.fromstring(file_b.read(),'c')
>
> tolerance = int(argv[3])
>
> chunks = []
> valid = True
>
> count_xcent = 0
> for x in xrange(len(a)):
>     for y in xrange(len(b)):
>         count = 0
>         if (a[x] == b[y]):
>             x_cnt, y_cnt = x,y
>             if (a[x_cnt] == b[y_cnt]):
>                 try:
>                     while (a[x_cnt+1] == b[y_cnt+1]):
>                         count += 1
>                         x_cnt += 1
>                         y_cnt += 1
>                 except IndexError:
>                     pass
>                 if ((count > tolerance) or (count == tolerance)):
>                     for tuple in chunks:
>                         if (((x >= tuple[0]) and (x_cnt <= tuple[1])) and
> ((y >= tuple[2]) and (y_cnt <= tuple[3]))):
>                             valid = False
>                             if __debug__:
>                                 print "Found an already processed
> subchunk"
>                             break
>                     if (valid):
>                         chunks.append ((x, x_cnt, y, y_cnt))
>                         print "Corresponding chunk found. List 1: from " +
> str(x) + " to " + str(x_cnt) +". List 2: from " + str(y) + " to " +
> str(y_cnt)
>                         print "with length of " + str (x_cnt + 1 - x)
>                     else:
>                         valid = True
>
>
> It simply scans the similarity matrix, finds the diagonals in wich a[x] ==
> a[y] and interprets the diagonal as a chunk. Then, it stores the chunk in a
> list and determines if it's a subchunk of another greater chunk already
> found.
>
> The problem is: this implementation is very slow, imho because of three
> factors:
>
> 1. I use a nested for loop to scan the matrix
>
> 2. When the start of a diagonal is found, the program scans the diagonal
> with another additional loop. Maybe it would be faster to use a function
> such as diagonal() (but I can't actually _create_ the boolean similarity
> matrix, as it gets way too big for files big enough (we're talking about ~
> 100 Kb files) - I am forced to compute it "on the way".
>
> 3. When I find a chunk, I compute all its subchunks with this approach,
> and compare them with the "big" chunk I stored in the list. Maybe there's a
> better algorithmical way.
>
> What do you think about these issues? Is there a way to optimize them? And
> are there other issues I didn't take in account?
>
> Thanks in advance,
>
> regards,
>
> Chris.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070912/ef610856/attachment.html>