[Python-Dev] Why is soundex marked obsolete?

M.-A. Lemburg mal@lemburg.com
Mon, 15 Jan 2001 12:56:37 +0100


Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > BTW, are there less English centric "sounds alike" matchers
> > around ?
> 
> Yes, but if anything there are far too many of them:  like Soundex, they're
> just heuristics, and *everybody* who cares adds their own unique twists,
> while proper studies are almost non-existent.  Few variants appear to be in
> use much beyond their inventor's friends; one notable exception in the
> Jewish community is the Daitch-Mokotoff variation, originally tailored to
> their unique needs but later generalized; a brief description here:
> 
>     http://www.avotaynu.com/soundex.html
> 
> The similarly involved NYSIIS algorithm (New York State Identification
> Intelligence System -- look for NYSIIS on Parnassus) was the winner from a
> field of about two dozen competing algorithms, after measuring their
> effectiveness on assorted databases maintained by the state of New York.
> Since New York has a large immigrant population, NYSIIS isn't as
> Anglocentric as Soundex either.

Thanks for the pointer. I'll add that module to my lib :)

       http://metagram.webreply.com/downloads/nysiis.py

Perhaps Eric ought to add this one to his package as well  ?!
BTW, where can I find your package on the web, Eric ? I'd like
to give it a ride under German language conditions ;)
 
> But state-of-the-art has given up on purely computational algorithms for
> these purposes:  proper names are simply too much a mess.  For example, if I
> search for "Richard", it *ought* to match on "Dick"; if my Arab buddy
> searches on "Mohammed", it *ought* to match on "Mhd"; "the rules" people
> actually use just aren't reducible to pure computation -- it takes a large
> knowledge base to capture what people "just know".  You may enjoy visiting
> this commercial site (AFAIK, nobody is giving away state-of-the-art for
> free):
> 
>     http://www.las-inc.com/

Sad -- "patent pending" algorithms don't help anyone on this
planet :(
 
> > ...
> >     http://physics.nist.gov/cuu/Reference/soundex.html
> >
> > works fine for English texts,
> 
> If that were true, the English-speaking researchers would have declared
> victory 120 years ago <wink>.  But English pronunciation is *notoriously*
> difficult to predict from spelling, partly because English is the Perl of
> human languages.

Then Dutch must be the Python of human languages... ;)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/