Percentage matching of text

Dan Bishop danb_83 at yahoo.com
Sat Jul 31 05:03:51 EDT 2004


Bruce Eckel <BruceEckel at MailBlocks.com> wrote in message news:<mailman.958.1091195562.5135.python-list at python.org>...
> Background: for the 4th edition of Thinking in Java, I'm trying to
> once again improve the testing scheme for the examples in the book. I
> want to verify that the output I show in the book is "reasonably
> correct." I say "Reasonably" because a number of examples produce
> random numbers or text or the time of day or in general things that do
> not repeat themselves from one execution to the next. So, much of the
> text will be the same between the "control sample" and the "test
> sample," but some of it will be different.
> 
> I will be using Python or Jython for the test framework.
> 
> What I'd like to do is find an algorithm that produces the results of
> a text comparison as a percentage-match. Thus I would be able to
> assert that my test samples must match the control sample by at least
> (for example) 83% for the test to pass. Clearly, this wouldn't be a
> perfect test but it would help flag problems, which is primarily what
> I need.
> 
> Does anyone know of an algorithm or library that would do this? Thanks
> in advance.

One of the simpler ones is to calculate the length of the longest
common subsequence of the test output and the control output.

def lcsLength(seqA, seqB):
   lenTable = [[0] * len(seqB) for i in seqA]
   # Set each lenTable[i][j] to lcsLength(seqA[:i+1], seqB[:j+1])
   for i, a in enumerate(seqA):
      for j, b in enumerate(seqB):
         if a == b:
            lenTable[i][j] = lenTable[i-1][j-1] + 1
         else:
            lenTable[i][j] = max(lenTable[i-1][j], lenTable[i][j-1])
   return lenTable[-1][-1]

To convert this to a percentage value, simply divide by the length of
the control output.

Btw, thank you for those footnotes in Thinking in Java that encouraged
me to try Python :-)



More information about the Python-list mailing list