Comparing 2 similar strings?
Rob W. W. Hooft
news at rob.hooft.net
Sun May 29 00:42:42 EDT 2005
Irmen de Jong wrote:
> Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0?
Maybe. You could do that by ignoring "negative" values, and by dividing
by min(len(s1),len(s2))+1. For my application this is irrelevant, I only
need a scale to compare a single word to many different words, and pick
the one that stands out. I can relate the (unnormalized) score to the
number of typos in the word.
> Furthermore, I can also produce wrong(?) results:
>
> $ python comparestrings.py
> s1: test
> s2: x
> Score: -0.4
>
> Minus 0.4... Is this a bug?
Not in the procedure, it is a bug in my description. It was late at
night when I wrote it. What you are doing here is aligning the two words
like this:
test
--x-
the t opens a gap (score -0.3), e elongates the gap (-0.2), s and x are
adjacent on the keyboard (score 0.4) and t opens a gap (-0.3). Total
score is -0.4. If you compare "test" with "l" you even get a score of
-0.7. You can make arbitrarily low numbers by comparing long strings of
"a" characters with a single "l".
What this is telling you is that not only are the letters all different
(score 0.0), but the strings are not even equally long (hence <0.0). The
gap-open and gap-elongation penalties can be adjusted in the module. The
-0.3 and -0.2 values were set after some experimentation to make the
relative scores of different matches "feel" natural.
--
Rob W.W. Hooft || rob at hooft.net || http://www.hooft.net/people/rob/
More information about the Python-list
mailing list