[Doc-SIG] soundex module status?

Fred L. Drake Fred L. Drake, Jr." <fdrake@acm.org
Tue, 29 Jun 1999 15:44:32 -0400 (EDT)


--d503sbm0jq
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit


Tim Peters writes:
 > Oh, you people are so charmingly naive <wink>.  There are at least a dozen
 > algorithms *called* "Soundex" out there, and unless you're eager to dig into

Tim,
  Some of us strive very hard to reach our nirvana of naivte!  ;-)  It 
wasn't that long ago that I added the reference to Knuth to the
documentation that I forgot forging through his history of the
approach taken.

 > 1) Soundex certainly doesn't deserve to be a std C module!  It makes an OK
 > demo of a Python extension, though.

  When I spoke with Guido about soundex, maybe a year ago, and what we 
should do about it, his comment wasn't that different from yours.
Having a Python implementation made sense to support JPython, so I
cooked one up.  Once that was done, Guido prompty rejected it because
he didn't want to add new code (I think he was close to a release at
the time), even though I strove to make it match the existing module
in both results and interface.

 > 2) If Skip promises to take it over, I'll attach a Python implementation of
 > Knuth's version of Soundex.  This isn't the same algorithm as the current

  And I'll attach my version as well, since it's compatible with the
existing module.


  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives


--d503sbm0jq
Content-Type: text/x-python
Content-Description: soundex module in Python
Content-Disposition: inline;
	filename="soundex2.py"
Content-Transfer-Encoding: 7bit

"""The soundex algorithm takes an English word, and returns an
easily-computed hash of it; this hash is intended to be the same for
words that sound alike.  This module provides an interface to the
soundex algorithm.

Note that the soundex algorithm is quite simple-minded, and isn't
perfect by any measure.  Its main purpose is to help looking up names
in databases, when the name may be misspelled -- soundex hashes common
misspellings together.
"""

import string


def get_soundex(string):
    """Return the soundex hash value for a word; it will always be a
    6-character string.  `string' must contain the word to be hashed,
    with no leading whitespace; the case of the word is ignored.  (Note
    that the original algorithm produces a 4-character result.)"""

    s = string.upper(string)
    if not s:
        return '000000'
    r = s[0]
    s = s[1:]
    while len(r) < 6 and s:
        c = s[0]
        s = s[1:]
        if c in "WHAIOUY":
            pass
        elif c in "BFPV":
            if r[-1] != '1':
                r = r + '1'
        elif c in "CGJKQSXZ":
            if r[-1] != '2':
                r = r + '2'
        elif c in "DT":
            if r[-1] != '3':
                r = r + '3'
        elif c == "L":
            if r[-1] != '4':
                r = r + '4'
        elif c in "MN":
            if r[-1] != '5':
                r = r + '5'
        elif c == "R":
            if r[-1] != '6':
                r = r + '6'
    return r + '0' * (6 - len(r))


def sound_similar(s1, s2):
    """Returns true if both arguments have the same soundex code."""
    return get_soundex(s1) == get_soundex(s2)

--d503sbm0jq--