soundex (revisited)
Daniel Klein
DanielK at aracnet.com
Mon Dec 25 12:40:47 EST 2000
Thanks for the critique Greg.
Looks like I just had the one bug with consecutive letters. I was originally
comparing the next with previous soundex codes (rather than next/previous
alpha character) but it was not handling names like 'LLOYD'. I also took
your advice on char.isalpha() and ord('A').
Thanks again,
Daniel Klein
Beaverton, Oregon USA
"Greg Jorgensen" <gregj at pobox.com> wrote in message
news:YPE16.190877$U46.5871952 at news1.sttls1.wa.home.com...
> "Daniel Klein" <DanielK at aracnet.com> wrote in message
> news:Var16.263$LU6.109277 at typhoon.aracnet.com...
> > After seeing the post from several days ago on soundex, I gave it whirl
to
> > see if I could come up with something different (and possibly better),
> > following the rules laid down by Knuth:
> >
> > def get_soundex(name, digits = 3):
> > soundexcodes = "01230120022455012623010202"
> > # ABCDEFGHIJKLMNOPQRSTUVWXYZ
> > instring = name.upper()
> > soundex = instring[0]
> > last = soundex
> > instring = instring[1:]
> > for char in instring:
> > if 65 <= ord(char) <= 90:
> > sx = soundexcodes[ord(char) - 65]
> > if int(sx) and char != last:
> > soundex += sx
> > last = char
> > if len(soundex) < (digits + 1): soundex = (soundex + ("0" * digits))
> > return soundex[:digits + 1]
>
> I see a few problems, mainly in the handling of consecutive consonants.
You
> are checking for consecutive characters, but the Soundex algorithm
specifies
> that consecutive character codes be treated as a single code. Both 'mm'
and
> 'mn' are considered consecutive codes because both 'm' and 'n' are coded
as
> 5.
>
> You can (and probably should) use the isalpha() string method to check for
> alpha characters, rather than the 'magic numbers' 65 through 90. Likewise
> ord(char) - ord('A') is a bit more clear.
>
> Here's a version I wrote. I'm open to any criticisms, suggestions, etc. I
> compared my version to the module announced here a while back (I think
mine
> is a lot more readable; it is certainly shorter). I also compared it to a
> Perl version I found and I think my implementation is more robust and
> smaller.
>
> def soundex(name, len=4):
> """ soundex module conforming to Knuth's algorithm
> implementation 2000-12-24 by Gregory Jorgensen
> public domain
> """
>
> # digits holds the soundex values for the alphabet
> digits = '01230120022455012623010202'
> sndx = ''
> fc = ''
>
> # translate alpha chars in name to soundex digits
> for c in name.upper():
> if c.isalpha():
> if not fc: fc = c # remember first letter
> d = digits[ord(c)-ord('A')]
> # duplicate consecutive soundex digits are skipped
> if not sndx or (d != sndx[-1]):
> sndx += d
>
> # replace first digit with first alpha character
> sndx = fc + sndx[1:]
>
> # remove all 0s from the soundex code
> sndx = sndx.replace('0','')
>
> # return soundex code padded to len characters
> return (sndx + (len * '0'))[:len]
>
>
> --
> Greg Jorgensen
> Deschooling Society
> Portland, Oregon, USA
> gregj at pobox.com
>
>
More information about the Python-list
mailing list