[Python-Dev] Why is soundex marked obsolete?

Tim Peters tim.one@home.com
Sun, 14 Jan 2001 05:45:53 -0500


[Tim]
>> ...It's a bunch of hacks that appeal not because they solve
>> a problem, but because they're cute algorithms that are pretty
>> easy to implement and kinda solve part of a problem.

[Eric]
> Again, my experience says differently.  I have actually *used*
> Ratcliff-Obershelp to implement Do What I Mean (actually, Tell Me What
> I Mean) -- and had it work very well for non-geek users.  That's why I
> want other Python programmers to have easy access to the capability.
> ...
> Where techniques like Ratcliff-Obershelp really shine (and what I
> expect the module to be used for) is with controlled vocabularies
> such as command interfaces.

Yet the narrower the domain, the less call for a library with multiple
approaches.  If R-O really shone for you, why bother with anything else?
Seriously.  You haven't used some (most?) of these.  The core isn't a place
for research modules either (note that I have no objection whatsoever to
writing any module you like -- the only question here is what belongs in the
core, and any algorithm *nobody* here has experience with in your target
domain is plainly a poor *core* candidate for that reason alone -- we have
to maintain, justify and explain it for years to come).

> I suspect your speech recognition experience has given you an
> unhelpful bias.

Try to think of it as a helpfully different perspective <0.5 wink>.  It's in
favor of measuring error rate by controlled experiments, skeptical of
intuition, and dismissive of anecdotal evidence.  I may well agree you don't
need all that heavy machinery if I had a clear definition of what problem it
is you're trying to solve (I've learned it's not the kinds of problems *I*
had in mind when I first read your description!).

BTW, telephone speech recog requires controlled vocabularies because phone
acoustics are too poor for the customary close-talking microphone approaches
to work well enough.  A std technique there is to build a "confusability
matrix" of the words *in* the vocabulary, to spot trouble before it happens:
if two words are acoustically confusable, it flags them and bounces that
info back to the vocabulary designer.  A similar approach should work well
in your domain:  if you get to define the cmd interface, run all the words
in it pairwise through your similarity measure of choice, and dream up new
words whenever a pair is "too close".  That all but ensures that even a
naive similarity algorithm will perform well (in telephone speech recog, the
unconstrained error rate is up to 70% on cell phones; by constraining the
vocabulary with the aid of confusability measures, we cut that to under 1%).

> ...
> (Actually, my gut after thinking about both algorithms hard is that
> R/O is still a better technique than Levenshtein for the kind of
> application I have in mind.  But I also suspect the difference is
> marginal.)

So drop Levenshtein -- go with your best shot.  Do note that they both
(usually) consider a single transposition to be as much a mutation as two
replacements (or an insert plus a delete -- "pure" Levenshtein treats those
the same).

What happens when the user doesn't enter an exact match?  Does the kind of
app you have in mind then just present them with a list of choices?  If
that's all (as opposed to, e.g., substituting its best guess for what the
user actually typed and proceeding as if the user had given that from the
start), then the evidence from studies says users are almost as pleased when
the correct choice appears somewhere in the first three choices as when it
appears as *the* top choice.  A well-designed vocabulary can almost
guarantee that happy result (note that most of the current research is aimed
at the much harder job of getting the intended word into the #1 slot on the
choice list).

> (Other good uses for algorithms in this class include cladistics and
> genomic analysis.)

I believe you'll find current work in those fields has moved far beyond
these simplest algorithms too, although they remain inspirational (for
example, see
"Protein Sequence Alignment and Database Scanning" at

    http://barton.ebi.ac.uk/papers/rev93_1/rev93_1.html

Much as in typing, some mutations are more likely than others for *physical*
reasons, so treating all pairs of symbols in the alphabet alike is too gross
a simplification.).

>> Even for most developers, it would be better to package up the
>> single best approach you've got (f(list, word) -> list of possible
>> matches sorted in confidence order), instead of a module with 6
>> (or so) functions they don't understand and a pile of equally
>> mysterious knobs.

> That's why good documentation, with motivating usage hints, is
> important.  I write good documentation, Tim.

You're not going to find offense here even if you look for it, Eric <wink>:
while only a small percentage of developers don't read docs at all, everyone
else spaces out at least in linear proportion to the length of the docs.
Most people will be looking for "a solution", not for "a toolkit".  If the
docs read like a toolkit, it doesn't matter how good they are, the bulk of
the people you're trying to reach will pass on it.  If you really want this
to be *used*, supply one class that does *all* the work, including making
the expert-level choices of which algorithm is used under the covers and how
it's tuned.  That's good advice.

I still expect Guido won't want it in the core before wide use is a
demonstrated fact, though (and no, that's not a chicken-vs-egg thing:  "wide
use" for a thing outside the core is narrower than "wide use" for a thing in
the core).  An exception would likely get made if he tried it and liked it a
lot.  But to get it under his radar, it's again much easier if the usage
docs are no longer than a couple paragraphs.

I'll attach a tiny program that uses ndiff's SequenceMatcher to guess which
of the 147 std 2.0 top-level library modules a user may be thinking of (and
best I can tell, these are the same results case-folding R/O would yield):

Module name? random
Hmm.  My best guesses are random, whrandom, anydbm
(BTW, the first choice was an exact match)
Module name? disect
Hmm.  My best guesses are bisect, dis, UserDict
Module name? password
Hmm.  My best guesses are keyword, getpass, asyncore
Module name? chitchat
Hmm.  My best guesses are whichdb, stat, asynchat
Module name? xml
Hmm.  My best guesses are xmllib, mhlib, xdrlib

[So far so good]

Module name? http
Hmm.  My best guesses are httplib, tty, stat

[I was thinking of httplib, but note that it missed
 SimpleHTTPServer:  a name that long just isn't going to score
 high when the input is that short]

Module name? dictionary
Hmm.  My best guesses are Bastion, ConfigParser, tabnanny

[darn, I *think* I was thinking of UserDict there]

Module name? uuencode
Hmm.  My best guesses are code, codeop, codecs

[Missed uu]

Module name? parse
Hmm.  My best guesses are tzparse, urlparse, pre
Module name? browser
Hmm.  My best guesses are webbrowser, robotparser, user
Module name? brower
Hmm.  My best guesses are webbrowser, repr, reconvert
Module name? Thread
Hmm.  My best guesses are threading, whrandom, sched
Module name? pickle
Hmm.  My best guesses are pickle, profile, tempfile
(BTW, the first choice was an exact match)
Module name? shelf
Hmm.  My best guesses are shelve, shlex, sched
Module name? katmandu
Hmm.  My best guesses are commands, random, anydbm

[I really was thinking of "commands"!]

Module name? temporary
Hmm.  My best guesses are tzparse, tempfile, fpformat

So it gets what I was thinking of into the top 3 very often, and despite
some wildly poor guesses at the correct spelling -- you'd *almost* think it
was doing a keyword search, except the *unintended* choices on the list are
so often insane <wink>.

Something like that may be a nice addition to Paul/Ping's help facility
someday too.

Hard question:  is that "good enough" for what you want?  Checking against
147 things took no perceptible time, because SequenceMatcher is already
optimized for "compare one thing against N", doing preprocessing work on the
"one thing" that greatly speeds the N similarity computations (I suspect
you're not -- yet).  It's been tuned and tested in practice for years; it
works for any sequence type with hashable elements (so Unicode strings are
already covered); it works for long sequences too.  And if R-O is the best
trick we've got, I believe it already does it.  Do we need more?  Of course
*I'm* not convinced we even need *it* in the core, but packaging a
match-1-against-N class is just a few minutes' editing of what follows.

something-to-play-with-anyway-ly y'rs  - tim


NDIFFPATH = "/Python20/Tools/Scripts"
LIBPATH = "/Python20/Lib"

import sys, os

sys.path.append(NDIFFPATH)
from ndiff import SequenceMatcher

modules = {}  # map lowercase module stem to module name
for f in os.listdir(LIBPATH):
    if f.endswith(".py"):
        f = f[:-3]
        modules[f.lower()] = f

def match(fname, numchoices=3):
    lower = fname.lower()
    s = SequenceMatcher()
    s.set_seq2(lower)
    scores = []
    for lowermod, mod in modules.items():
        s.set_seq1(lowermod)
        scores.append((s.ratio(), mod))
    scores.sort()
    scores.reverse()
    return modules.has_key(lower), [x[1] for x in scores[:numchoices]]

while 1:
    name = raw_input("Module name? ")
    is_exact, choices = match(name)
    print "Hmm.  My best guesses are", ", ".join(choices)
    if is_exact:
        print "(BTW, the first choice was an exact match)"