Comparing 2 similar strings?
Steve Holden
steve at holdenweb.com
Fri May 20 15:31:26 EDT 2005
Chris Croughton wrote:
> On Thu, 19 May 2005 06:38:59 +1000, John Machin
> <sjmachin at lexicon.net> wrote:
>
>
>>On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <morton at lsupcaemnt.com>
>>wrote:
>>
>>
>>>William Park wrote:
>>>
>>>
>>>>How do you compare 2 strings, and determine how much they are "close" to
>>>>each other? Eg.
>>>> aqwerty
>>>> qwertyb
>>>>are similar to each other, except for first/last char. But, how do I
>>>>quantify that?
>>>
>>>"However you like" is probably the right answer, but one way might be to
>>>compare their soundex encoding
>>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out
>>>percentage difference based on comparing the numeric part.
>>
>>Fantastic suggestion. Here's a tiny piece of real-life test data:
>>
>>compare the surnames "Mousaferiadis" and "McPherson".
>
>
> I remember a word processor, MultiMate, which used Soundex to do
> matching for its suggestions for spelling correction. One of my
> cow-orkers typed the word 'agains' -- it was supposed to be 'against'
> but 'again' would also have been a sensible suggestion. MultiMate,
> however, suggested neither of those reasonable words, it did suggest
> "iguanas" amd "Utahns"...
>
> (I wonder what it does with "Talliafero" and "Tolliver", and with
> "Featherstone-Howe" and "Fanshaw"...)
>
> The answer to the OP, which someone just pointed out to me on
> comp.programming, is "edit distance" or "Levenshtein Distance"[1]. A
> google search on either produces some good descriptions in the first few
> links, including http://www.merriampark.com/ld.htm which has not only a
> description of the algorithm but also source code in Java, C++ and
> Visual Basic (no Awk, thought there are links to pages with
> implementations in Perl, Python, Delphi and many more)...
>
> [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein'
> in Schwaebisch (south German) fashion, but apparently the author of the
> algorithm was one Vladimir I. Levenshtein and that is how he is credited
> on the IEEE site. There are also a number of Google hits under the
> 'stein' spelling, though...
>
> Chris C
And what's the Levenshtein distance between "levenshtein" and
"levenstein"? Ironic if it were large ...
regards
Steve
--
Steve Holden +1 703 861 4237 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
More information about the Python-list
mailing list