[Doc-SIG] soundex module status?

Tim Peters tim_one@email.msn.com
Tue, 29 Jun 1999 00:16:28 -0400


Oh, you people are so charmingly naive <wink>.  There are at least a dozen
algorithms *called* "Soundex" out there, and unless you're eager to dig into
the historical archives the original algorithm is mostly likely nowhere to
be found.

I've always been particularly fond of Python's, because it's missing a case
stmt for the most popular letter in the language <e-wink!>.

In any case,

1) Soundex certainly doesn't deserve to be a std C module!  It makes an OK
demo of a Python extension, though.

2) If Skip promises to take it over, I'll attach a Python implementation of
Knuth's version of Soundex.  This isn't the same algorithm as the current
module implements (this one gives better results due to its funky treatment
of w and h and sensible treatment of vowels), but I must protest that
"Knuth's version" isn't entirely well-defined!  This implementation at least
matches the concrete examples he gives.

P362-ly y'rs  - tim

PS:

> Except for lopping off the last two characters, the only difference I
> found was in the mappings for Lukasiewicz and Lissajous.  The Perl
> version yields Z222, while the soundex module yields L200.

The Perl result is "correct" here -- any decent variation of Soundex will
treat voiced vowels as breaking runs of otherwise-similar letters.  This
isn't magic:  it's supposed to given a crude encoding of how a word
*sounds*.  Voiced vowels are confusable so don't deserve their own encoding,
but they certainly break up the consonant sounds on either side.  Speaking
of which, the Knuthian rules pretty much suck for many non-English
languages -- let's make /F write a Unicode version of this <wink>.


Module follows:

NDIGITS = 3

import string
_upper = string.upper
_translate = string.translate

# B for Break -- all characters assumed simply to break runs
_tran = ["B"] * 256
def _setcode(letters, value):
    for ch in letters:
        _tran[ord(_upper(ch))] = _tran[ord(ch)] = value
_setcode("bfpv", "1")
_setcode("cgjkqsxz", "2")
_setcode("dt", "3")
_setcode("l", "4")
_setcode("mn", "5")
_setcode("r", "6")
# B for Break -- these guys break runs
_setcode("aeiouy", "B")
# I for Invisible -- they don't count for anything except as first char
_setcode("hw", "I")
assert len(filter(lambda ch: ch != "B", _tran)) == \
       (26 - len("aeiouy")) * 2, \
       "Soundex initialization screwed up"
_tran = string.join(_tran, "")
del string, _setcode

def soundex(name):
    """name -> Soundex code, following Knuth Vol 3 Ed 2 pg 394."""

    if not name:
        raise ValueError("soundex requires non-empty name argument")
    coded = _translate(name, _tran)
    out = _upper(name[0])
    lastrealcode = coded[0]
    ignore_same = 1
    for thiscode in coded[1:]:
        if thiscode == "B":
            ignore_same = 0
            continue
        if thiscode == "I":
            continue
        if ignore_same and lastrealcode == thiscode:
            continue
        out = out + thiscode
        lastrealcode = thiscode
        ignore_same = 1
        if len(out) > NDIGITS:
            break
    if len(out) < NDIGITS + 1:
        out = out + "0" * (NDIGITS + 1 - len(out))
    return out

def _test():
    global nerrors
    def check(name, expected):
        global nerrors
        got = soundex(name)
        if got != expected:
            nerrors = nerrors + 1
            print "error in Soundex test: name", name, \
                  "expected", expected, "got", got
    nerrors = 0
    check("Euler", "E460")
    check("Ellery", "E460")
    check("guass", "G200")
    check("gauss", "G200")
    check("Ghosh", "G200")
    check("HILBERT", "H416")
    check("Heilbronn", "H416")
    check("Knuth", "K530")
    check("K ** n  U123t9247h   ", "K530")
    check("Kant", "K530")
    check("Lloyd", "L300")
    check("Liddy", "L300")
    check("Lukasiewicz", "L222")
    check("Lissajous", "L222")
    check("Wachs", "W200")
    check("Waugh", "W200")
    check("HYHYH", "H000")
    check("kkkkkkkwwwwkkkkkhhhhhkkkkemmnmnhmn", "K500")
    check("Rogers", "R262")
    check("Rodgers", "R326")
    check("Sinclair", "S524")
    check("St. Clair", "S324")
    check("Tchebysheff", "T212")
    check("Chebyshev", "C121")

    if nerrors:
        raise SystemError("soundex test failed with " + `nerrors` +
                          " errors")

if __name__ == "__main__":
    _test()