[Python-Dev] Why is soundex marked obsolete?

Tim Peters tim.one@home.com
Sat, 13 Jan 2001 16:34:10 -0500


[Eric S. Raymond]
> I have a new goodie for the 2.1 standard library, a module called
> "simil" that supports computation of similarity indices between
> strings such as one might use for recovery-matching of misspellings
> against a dictionary.

My guess is that Guido won't accept it.

> The three methods supported are stemming, normalized Hamming
> similarity, and (the star of the show) Ratcliff-Obershelp gestalt
> subpattern matching.  The latter is spookily effective for detecting
> not just substition typos but insertions and deletions.  The module is
> a C extension (my first!) for speed and because the Ratcliff-Obershelp
> implementation uses pointer arithmetic heavily.

Never heard of R-O, so tracked down some C code via google.  It appears I
invented the same algorithm at Cray Research in the early 80's for a diff
generator, which later got reincarnated in my ndiff.py (in the
Tools/scripts/ directory).  ndiff generates "human-friendly" diffs between
text files, at both the "file is a sequence of lines" and "line is a
sequence of characters" levels.  I didn't have the hyperbolic marketing
genius to call it "gestalt subpattern matching", though <wink> -- I thought
of it as what Unix diff *would* do if it constrained itself to matching
*contiguous* subsequences, and under the theory people would find that more
natural because contiguity is something the human visual system naturally
latches on to.  ndiff can be spookily natural in practice too.

> It's documented, tested, and ready to go.  But having written it, I
> now have a question: why is soundex marked obsolete?  Is there
> something wrong with the algorithm or implementation?

What is the soundex algorithm?  Not joking.  Skip Montanaro and I were
unable to find the algorithm implemented by soundex.c anywhere in the
literature, and I never found *any* two definitions that were the same.
Even Knuth changed his description of Soundex between editions 2 and 3 of
volume 3.  Skip eventually merged my and Fred Drake's Python implementations
of Knuth Vol 3 Ed 3 Soundex (see the Vaults of Parnassus).

> If not, then it would be natural for simil to absorb the existing
> soundex implementation as a fourth entry point.

Well, soundex.c doesn't match any other Soundex on earth, so it's not worth
reproducing in new code.  Guido doesn't want to be in the middle of fighting
over ill-defined algorithms, so booted Soundex entirely.  Another candidate
for inclusion is the NYSIIS algorithm, which is probably in more "serious"
use than Soundex anyway.  Same thing with NYSIIS, though (i.e., what--
exactly --is "the NYSIIS algorithm"?), except that Knuth didn't do us the
favor of making up his own variation that will *become* "the std" via force
of reputation.  Sean True implemented *a* NYSIIS in Python (and again see
the Vaults for a link to that).

So that's why the module is unlikely to make it into the core:

+ There are any number of algorithms people may want to see (I don't know
what "normalized Hamming similarity" means, but if it's not the same as
Levenshtein edit distance then add the latter to the pot too).

+ Each algorithm on its own is likely controversial.

+ Computing string similarity is something few apps need anyway.

Lots of hassle + little demand == not a natural for the core.  ndiff is in
the core only because many people found the *app* useful; its
SequenceMatcher class isn't even advertised.

may-never-understand-how-bigints-got-into-python<wink>-ly
    y'rs  - tim