Comparing 2 similar strings?

Chris Croughton chris at keristor.net
Thu May 19 08:45:34 EDT 2005


On Thu, 19 May 2005 06:38:59 +1000, John Machin 
   <sjmachin at lexicon.net> wrote:

> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <morton at lsupcaemnt.com>
> wrote:
> 
>>William Park wrote:
>>
>>> How do you compare 2 strings, and determine how much they are "close" to
>>> each other?  Eg.
>>>     aqwerty
>>>     qwertyb
>>> are similar to each other, except for first/last char.  But, how do I
>>> quantify that?
>>
>>"However you like" is probably the right answer, but one way might be to 
>>compare their soundex encoding 
>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>percentage difference based on comparing the numeric part.
> 
> Fantastic suggestion. Here's a tiny piece of real-life test data:
> 
> compare the surnames "Mousaferiadis" and "McPherson".

I remember a word processor, MultiMate, which used Soundex to do
matching for its suggestions for spelling correction.  One of my
cow-orkers typed the word 'agains' -- it was supposed to be 'against'
but 'again' would also have been a sensible suggestion.  MultiMate,
however, suggested neither of those reasonable words, it did suggest
"iguanas" amd "Utahns"...

(I wonder what it does with "Talliafero" and "Tolliver", and with
"Featherstone-Howe" and "Fanshaw"...)

The answer to the OP, which someone just pointed out to me on
comp.programming, is "edit distance" or "Levenshtein Distance"[1].  A
google search on either produces some good descriptions in the first few
links, including http://www.merriampark.com/ld.htm which has not only a
description of the algorithm but also source code in Java, C++ and
Visual Basic (no Awk, thought there are links to pages with
implementations in Perl, Python, Delphi and many more)...

[1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein'
in Schwaebisch (south German) fashion, but apparently the author of the
algorithm was one Vladimir I. Levenshtein and that is how he is credited
on the IEEE site.  There are also a number of Google hits under the
'stein' spelling, though...

Chris C



More information about the Python-list mailing list