Comparing 2 similar strings?

Steve Holden steve at holdenweb.com
Fri May 20 15:31:26 EDT 2005


Chris Croughton wrote:
> On Thu, 19 May 2005 06:38:59 +1000, John Machin 
>    <sjmachin at lexicon.net> wrote:
> 
> 
>>On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <morton at lsupcaemnt.com>
>>wrote:
>>
>>
>>>William Park wrote:
>>>
>>>
>>>>How do you compare 2 strings, and determine how much they are "close" to
>>>>each other?  Eg.
>>>>    aqwerty
>>>>    qwertyb
>>>>are similar to each other, except for first/last char.  But, how do I
>>>>quantify that?
>>>
>>>"However you like" is probably the right answer, but one way might be to 
>>>compare their soundex encoding 
>>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>>percentage difference based on comparing the numeric part.
>>
>>Fantastic suggestion. Here's a tiny piece of real-life test data:
>>
>>compare the surnames "Mousaferiadis" and "McPherson".
> 
> 
> I remember a word processor, MultiMate, which used Soundex to do
> matching for its suggestions for spelling correction.  One of my
> cow-orkers typed the word 'agains' -- it was supposed to be 'against'
> but 'again' would also have been a sensible suggestion.  MultiMate,
> however, suggested neither of those reasonable words, it did suggest
> "iguanas" amd "Utahns"...
> 
> (I wonder what it does with "Talliafero" and "Tolliver", and with
> "Featherstone-Howe" and "Fanshaw"...)
> 
> The answer to the OP, which someone just pointed out to me on
> comp.programming, is "edit distance" or "Levenshtein Distance"[1].  A
> google search on either produces some good descriptions in the first few
> links, including http://www.merriampark.com/ld.htm which has not only a
> description of the algorithm but also source code in Java, C++ and
> Visual Basic (no Awk, thought there are links to pages with
> implementations in Perl, Python, Delphi and many more)...
> 
> [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein'
> in Schwaebisch (south German) fashion, but apparently the author of the
> algorithm was one Vladimir I. Levenshtein and that is how he is credited
> on the IEEE site.  There are also a number of Google hits under the
> 'stein' spelling, though...
> 
> Chris C

And what's the Levenshtein distance between "levenshtein" and 
"levenstein"? Ironic if it were large ...

regards
  Steve
-- 
Steve Holden        +1 703 861 4237  +1 800 494 3119
Holden Web LLC             http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/




More information about the Python-list mailing list