[Doc-SIG] soundex module status?
Fred L. Drake
Fred L. Drake, Jr." <fdrake@acm.org
Tue, 29 Jun 1999 15:44:32 -0400 (EDT)
--d503sbm0jq
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit
Tim Peters writes:
> Oh, you people are so charmingly naive <wink>. There are at least a dozen
> algorithms *called* "Soundex" out there, and unless you're eager to dig into
Tim,
Some of us strive very hard to reach our nirvana of naivte! ;-) It
wasn't that long ago that I added the reference to Knuth to the
documentation that I forgot forging through his history of the
approach taken.
> 1) Soundex certainly doesn't deserve to be a std C module! It makes an OK
> demo of a Python extension, though.
When I spoke with Guido about soundex, maybe a year ago, and what we
should do about it, his comment wasn't that different from yours.
Having a Python implementation made sense to support JPython, so I
cooked one up. Once that was done, Guido prompty rejected it because
he didn't want to add new code (I think he was close to a release at
the time), even though I strove to make it match the existing module
in both results and interface.
> 2) If Skip promises to take it over, I'll attach a Python implementation of
> Knuth's version of Soundex. This isn't the same algorithm as the current
And I'll attach my version as well, since it's compatible with the
existing module.
-Fred
--
Fred L. Drake, Jr. <fdrake@acm.org>
Corporation for National Research Initiatives
--d503sbm0jq
Content-Type: text/x-python
Content-Description: soundex module in Python
Content-Disposition: inline;
filename="soundex2.py"
Content-Transfer-Encoding: 7bit
"""The soundex algorithm takes an English word, and returns an
easily-computed hash of it; this hash is intended to be the same for
words that sound alike. This module provides an interface to the
soundex algorithm.
Note that the soundex algorithm is quite simple-minded, and isn't
perfect by any measure. Its main purpose is to help looking up names
in databases, when the name may be misspelled -- soundex hashes common
misspellings together.
"""
import string
def get_soundex(string):
"""Return the soundex hash value for a word; it will always be a
6-character string. `string' must contain the word to be hashed,
with no leading whitespace; the case of the word is ignored. (Note
that the original algorithm produces a 4-character result.)"""
s = string.upper(string)
if not s:
return '000000'
r = s[0]
s = s[1:]
while len(r) < 6 and s:
c = s[0]
s = s[1:]
if c in "WHAIOUY":
pass
elif c in "BFPV":
if r[-1] != '1':
r = r + '1'
elif c in "CGJKQSXZ":
if r[-1] != '2':
r = r + '2'
elif c in "DT":
if r[-1] != '3':
r = r + '3'
elif c == "L":
if r[-1] != '4':
r = r + '4'
elif c in "MN":
if r[-1] != '5':
r = r + '5'
elif c == "R":
if r[-1] != '6':
r = r + '6'
return r + '0' * (6 - len(r))
def sound_similar(s1, s2):
"""Returns true if both arguments have the same soundex code."""
return get_soundex(s1) == get_soundex(s2)
--d503sbm0jq--