Comparing 2 similar strings?

Wed May 18 22:07:07 EDT 2005

On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <morton at lsupcaemnt.com>
wrote:

>
>
>John Machin wrote:
>> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <morton at lsupcaemnt.com>
>> wrote:
>> 
>> 
>>>
>>>William Park wrote:
>>>
>>>
>>>>How do you compare 2 strings, and determine how much they are "close" to
>>>>each other?  Eg.
>>>>    aqwerty
>>>>    qwertyb
>>>>are similar to each other, except for first/last char.  But, how do I
>>>>quantify that?
>>>>
>>>>I guess you can say for the above 2 strings that
>>>>    - at max, 6 chars out of 7 are same sequence --> 85% max
>>>>
>>>>But, for
>>>>    qawerty
>>>>    qwerbty
>>>>max correlation is
>>>>    - 3 chars out of 7 are the same sequence --> 42% max
>>>>
>>>>(Crossposted to 3 of my favourite newsgroup.)
>>>>
>>>
>>>"However you like" is probably the right answer, but one way might be to 
>>>compare their soundex encoding 
>>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>>percentage difference based on comparing the numeric part.
>>>
>> 
>> 
>> Fantastic suggestion. Here's a tiny piece of real-life test data:
>> 
>> compare the surnames "Mousaferiadis" and "McPherson".
>> 
>
>Fantastic test data set. I know how to pronounce McPherson but I'd never 
>have guessed that Mousaferiadis sounds like it.

If you guessed "moose a ferry ah dis" i.e. phonetically you wouldn't
be far wrong. The point is that the two names neither look similar nor
sound similar. It is highly unlikely that one would be corrupted into
the other during either written or spoken communication. However they
get the same soundex code because the soundex method picks out MSFR
and MCPR and says in effect that S===C (sometimes) and F==P
(sometimes).

>
>I assume you were actually being facetious
> and trying to make the point 
>that names that don't look the same on paper can have the same soundex 
>encoding and that's obviously countered with the fact that soundex is 
>just a cheap and cheerful way to find names that probably sound similair 
>which can vary tremendously based on ethnicity or accent.

*If* you want phonetic similarity, there are methods that much better
than soundex, in the sense of fewer false positives and fewer false
negatives. Google for NYSIIS, dolby, metaphone, caverphone.

Cheap? You get what you pay for.

Cheerful? What's the relevance?

Someone who types "Mousaferiadis" into a customer search screen and
gets back several lines of McPherson and MacPherson is unlikely to be
cheerful -- even before we factor in the speed [soundex divides the
universe into a relative small number of buckets].

Someone who's looking for Erin when they should be looking for Aaron
(or vice versa) won't get much cheer out of soundex, either.

>
>It's a reasonable approach to consider given the very loose requirements 
>presented.

Soundex is *NEVER* a reasonable approach to consider. Phonetic
variation is only one consideration. In any case, the OP didn't appear
to be concerned with phonetic variations.