Fuzzy string matching?
John Machin
sjmachin at lexicon.net
Fri Aug 27 19:18:47 EDT 1999
On 27 Aug 99, at 9:52, Fredrik Lundh wrote: [[pruned]]
> Hans Nowak <hnowak at cuci.nl> wrote: [[pruned]]
>> I expect that some of their input will not
> > be recognized in the card database. In such cases, I would like to
> > present the user with a list of card names which match the input
> > closely, but not entirely... strings that 'almost match'. OK, so it's
> > just a fuzzy string match. Is such a thing done in Python already? If
> > not, would it be difficult to do?
>
> the soundex module (or a similar technique, tuned to
> your dictionary) could be what you need:
>
> import soundex
> print soundex.sound_similar("fredrik", "frederick")
> 1
> print soundex.sound_similar("fredrik", "hans")
> 0
Folks,
In my opinion, Soundex is generally far too hit-and-miss to be used in
what is essentially a simple spelling-and-typo-correction application.
The basic problem is that it gives a yes-no answer. The proportion of
false-positives and false-negatives will usually be found to be too
high.
"Standard" soundex uses a four-byte key. I once had to repair the
customer search component of an application --- among several
sillinesses, it was using soundex(surname) as the sole method of
grouping. This gave ludicrous results like:
-- search for "Mousaferiadis" and get one useful hit PLUS bucket-loads
of "McPherson" (false-positives)
-- search for "Moudaferiadis" (single-key typo) and get no hits even
though "Mousaferiadis" was in the database (false-negative)
Here are some illustrative results from the Python-supplied soundex
module, which generates a 6-byte key:
>>> soundex.get_soundex("McPherson")
'M21625'
>>> soundex.get_soundex("Mousaferiadis")
'M21632'
>>> soundex.get_soundex("Moudaferiadis")
'M31632'
>>> soundex.get_soundex("Mousaferiadia")
'M21630'
This module is incorrect in its handling of multiple occurrences of the
same code:
Bug #1 removes instances of the same code when they are separated by
vowels (or W, H, Y).
Bug #2 has the inverse problem when the input starts with a doubled
consonant.
>>> soundex.get_soundex("Mississippi")
'M21000' # Should be M221... (bug #1)
>>> soundex.get_soundex("Lloyd")
'L43000' # Should be L300... (bug #2)
>>> soundex.get_soundex("Loyd")
'L30000' # OK
>>> soundex.get_soundex("Lukasiewicz")
'L20000' # Should be L222... (bug #1)
>>> soundex.get_soundex("Lissajous")
'L20000' # Should be L222... (bug #1)
Reference: Knuth, D.E. "The Art of Computer Programming", vol 3, page
392.
There have been various attempts to produce a better phonetic code e.g.
NYSIIS code, Metaphon, ... but they still give a simplistic yes/no
answer.
In Hans' application, a method which searched the whole vocabulary
(small size in his case) and used a similarity score based on edit
distance (a.k.a. Levenshtein distance) is what I would recommend ---
see posting by Magnus Lie Hetland --- and maybe a follow-up or two :-)
HTH,
John
More information about the Python-list
mailing list