Comparing 2 similar strings?

Sun May 29 00:42:42 EDT 2005

Irmen de Jong wrote:
> Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0?

Maybe. You could do that by ignoring "negative" values, and by dividing 
by min(len(s1),len(s2))+1. For my application this is irrelevant, I only 
need a scale to compare a single word to many different words, and pick 
the one that stands out. I can relate the (unnormalized) score to the 
number of typos in the word.

> Furthermore, I can also produce wrong(?) results:
> 
> $ python comparestrings.py
> s1: test
> s2: x
> Score: -0.4
> 
> Minus 0.4... Is this a bug?

Not in the procedure, it is a bug in my description. It was late at 
night when I wrote it. What you are doing here is aligning the two words 
like this:

test
--x-

the t opens a gap (score -0.3), e elongates the gap (-0.2), s and x are 
adjacent on the keyboard (score 0.4) and t opens a gap (-0.3). Total 
score is -0.4. If you compare "test" with "l" you even get a score of 
-0.7. You can make arbitrarily low numbers by comparing long strings of 
"a" characters with a single "l".

What this is telling you is that not only are the letters all different 
(score 0.0), but the strings are not even equally long (hence <0.0). The 
gap-open and gap-elongation penalties can be adjusted in the module. The 
-0.3 and -0.2 values were set after some experimentation to make the 
relative scores of different matches "feel" natural.

-- 
Rob W.W. Hooft  ||  rob at hooft.net  ||  http://www.hooft.net/people/rob/