soundex (revisited)

Mon Dec 25 12:40:47 EST 2000

Thanks for the critique Greg.

Looks like I just had the one bug with consecutive letters. I was originally
comparing the next with previous soundex codes (rather than next/previous
alpha character) but it was not handling names like 'LLOYD'. I also took
your advice on char.isalpha() and ord('A').

Thanks again,
Daniel Klein
Beaverton, Oregon USA

"Greg Jorgensen" <gregj at pobox.com> wrote in message
news:YPE16.190877$U46.5871952 at news1.sttls1.wa.home.com...
> "Daniel Klein" <DanielK at aracnet.com> wrote in message
> news:Var16.263$LU6.109277 at typhoon.aracnet.com...
> > After seeing the post from several days ago on soundex, I gave it whirl
to
> > see if I could come up with something different (and possibly better),
> > following the rules laid down by Knuth:
> >
> > def get_soundex(name, digits = 3):
> >     soundexcodes = "01230120022455012623010202"
> >     #               ABCDEFGHIJKLMNOPQRSTUVWXYZ
> >     instring = name.upper()
> >     soundex = instring[0]
> >     last = soundex
> >     instring = instring[1:]
> >     for char in instring:
> >         if 65 <= ord(char) <= 90:
> >             sx = soundexcodes[ord(char) - 65]
> >             if int(sx) and char != last:
> >                 soundex += sx
> >                 last = char
> >     if len(soundex) < (digits + 1): soundex = (soundex + ("0" * digits))
> >     return soundex[:digits + 1]
>
> I see a few problems, mainly in the handling of consecutive consonants.
You
> are checking for consecutive characters, but the Soundex algorithm
specifies
> that consecutive character codes be treated as a single code. Both 'mm'
and
> 'mn' are considered consecutive codes because both 'm' and 'n' are coded
as
> 5.
>
> You can (and probably should) use the isalpha() string method to check for
> alpha characters, rather than the 'magic numbers' 65 through 90. Likewise
> ord(char) - ord('A') is a bit more clear.
>
> Here's a version I wrote. I'm open to any criticisms, suggestions, etc. I
> compared my version to the module announced here a while back (I think
mine
> is a lot more readable; it is certainly shorter). I also compared it to a
> Perl version I found and I think my implementation is more robust and
> smaller.
>
> def soundex(name, len=4):
>     """ soundex module conforming to Knuth's algorithm
>         implementation 2000-12-24 by Gregory Jorgensen
>         public domain
>     """
>
>     # digits holds the soundex values for the alphabet
>     digits = '01230120022455012623010202'
>     sndx = ''
>     fc = ''
>
>     # translate alpha chars in name to soundex digits
>     for c in name.upper():
>         if c.isalpha():
>             if not fc: fc = c   # remember first letter
>             d = digits[ord(c)-ord('A')]
>             # duplicate consecutive soundex digits are skipped
>             if not sndx or (d != sndx[-1]):
>                 sndx += d
>
>     # replace first digit with first alpha character
>     sndx = fc + sndx[1:]
>
>     # remove all 0s from the soundex code
>     sndx = sndx.replace('0','')
>
>     # return soundex code padded to len characters
>     return (sndx + (len * '0'))[:len]
>
>
> --
> Greg Jorgensen
> Deschooling Society
> Portland, Oregon, USA
> gregj at pobox.com
>
>