Comparing 2 similar strings?

Ed Morton morton at lsupcaemnt.com
Wed May 18 21:03:53 EDT 2005



John Machin wrote:
> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <morton at lsupcaemnt.com>
> wrote:
> 
> 
>>
>>William Park wrote:
>>
>>
>>>How do you compare 2 strings, and determine how much they are "close" to
>>>each other?  Eg.
>>>    aqwerty
>>>    qwertyb
>>>are similar to each other, except for first/last char.  But, how do I
>>>quantify that?
>>>
>>>I guess you can say for the above 2 strings that
>>>    - at max, 6 chars out of 7 are same sequence --> 85% max
>>>
>>>But, for
>>>    qawerty
>>>    qwerbty
>>>max correlation is
>>>    - 3 chars out of 7 are the same sequence --> 42% max
>>>
>>>(Crossposted to 3 of my favourite newsgroup.)
>>>
>>
>>"However you like" is probably the right answer, but one way might be to 
>>compare their soundex encoding 
>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>percentage difference based on comparing the numeric part.
>>
> 
> 
> Fantastic suggestion. Here's a tiny piece of real-life test data:
> 
> compare the surnames "Mousaferiadis" and "McPherson".
> 

Fantastic test data set. I know how to pronounce McPherson but I'd never 
have guessed that Mousaferiadis sounds like it. I suppose non-Celts 
probably wouldn't be able to guess how Dalziell, Drumnadrochit, Culzean, 
Ceilidh, or Concobarh are pronounced either.

I assume you were actually being facetious and trying to make the point 
that names that don't look the same on paper can have the same soundex 
encoding and that's obviously countered with the fact that soundex is 
just a cheap and cheerful way to find names that probably sound similair 
which can vary tremendously based on ethnicity or accent.

It's a reasonable approach to consider given the very loose requirements 
presented.

	Ed.



More information about the Python-list mailing list