Comparing 2 similar strings?

John Machin sjmachin at lexicon.net
Thu May 19 18:39:44 EDT 2005


On Fri, 20 May 2005 01:47:15 +1000, Steven D'Aprano
<steve at REMOVETHIScyber.com.au> wrote:

>On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:
>
>> None of the other approaches make the mistake of preserving the first
>> letter -- this alone is almost enough reason for jettisoning soundex.
>
>Off-topic now, but you've made me curious.
>
>Why is this a bad idea?
>
>How would you handle the case of "barow" and "marow"? (Barrow and
>marrow, naturally.) Without the first letter, they sound identical. Why is
>throwing that information away a good thing?

Sorry if that was unclear. By "preserving the first letter", I meant
that in "standard" soundex, the first letter is not transformed into a
digit.

Karen -> K650 
Kieran -> K650
(R->6, N->5; vowels->0 and then are squeezed out)

Now compare this:
Aaron -> A650
Erin -> E650

Bearing in mind that the usual application of soundex is "all or
nothing", the result is Karen == Kieran, but Aaron !== Erin, which is
at the very least extremely inconsistent.

A better phonetic-key creator would produce the same result for each
of the first pair, and for each of the second pair -- e.g. KARAN and
ARAN respectively.

Also consider Catherine vs Katherine.

Cheers,
John




More information about the Python-list mailing list