Soundex implementation (was: RE: Fuzzy string matching?)

Tim Peters tim_one at email.msn.com
Tue Aug 31 21:28:08 EDT 1999


[Tim, on Soundex]
>| I find Knuth's description of the algorithm ambiguous in several
>| respects,

[John Machin]
> I see no ambiguity here; certainly, it is very hard to see anything
> so ambiguous as to give you the leeway to treat W and H as different
> from Y! Please elaborate.

That part isn't ambiguous.  I don't know which person you thought of when I
wrote "Knuth", but the guy I'm thinking of makes a special case of W and H (but
not Y) in step 3:

   3. If two or more letters with the same code were adjacent in
      the original name (before step 1), or adjacent except for
      intervening h's and w's, omit all but the first.

If you can read that as saying W and H are the same as Y, the ambiguities are
even worse than I had suspected <wink>.

>| and there are many incompatible soundex algorithms "out
>| there".

> So Python should have yet another incompatible soundex algorithm?
> I didn't think me-too-ism was the Python way. My preference would
> be to implement what's in Knuth's book, subject to resolution of
> the ambiguity question.

I did, to the best of my understanding.  If your understanding differs, write
an implementation that matches yours, and we can claw each other to death over
which is the True Knuthian Algorithm <wink>.

>| soundex-is-a-pragmatic-hack-so-right-vs-wrong-is-
>| fuzzy-ly y'rs - tim

> I agree entirely but I don't think your WY hack improves the result --
> - please supply some counter-examples if you have any ---

As above, it was Knuth's "hack", but I don't think it's hard to see the
motivation:  vowels between similar consonants generally "break up" the latter
aurally, but in English "h" and "w" generally aren't voiced intraword while "y"
generally is (and Knuth is treating all of "why" as vowels in step 1).  An
example is his own Wachs -> W200.  Here c and s both map to 2, but the step-3
rule correctly predicts that the "h" between them "can't be heard".  OTOH,
Wacys is more likely to sound like Wacis or Wakoz:  an intraword vowel is
generally voiced in English, and then W220 is the proper code.

> but can have a detrimental effect with a few Welsh names; see below.

Like Uncle Don gives a hoot about Wales <wink>.

saw-a-hebrew-variant-of-soundex-once-that-slobbered-on-for-
    pages-ly y'rs  - tim






More information about the Python-list mailing list