Looking for library to estimate likeness of two strings

John Machin sjmachin at lexicon.net
Thu Feb 7 17:31:48 EST 2008


On Feb 7, 10:37 pm, Matthew_WAR... at bnpparibas.com wrote:
> > On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
>
> > > Jeff Schwab wrote:
> > ...
> > >> If the strings happen to be the same length, the Levenshtein distance
> > >> is equivalent to the Hamming distance.
>
> Is this really what the OP was asking for. If I understand it correctly,
> Levenshtein distance works out the number of edits required to transform
> the string to the target string. The smaller the more equivalent, but with
> the OP's problem I would expect
>
> table1      table2
> brian       briam
>             erian
>
> I think the OP would like to guess at 'briam' rather than 'erian', but
> Levenstein would rate them equally good guesses?
>
> I know this is pushing it more toward phonetic alaysis of the words or
> something similar, and thats orders of magnitude more complex.
>

Not very. The edit distance idea can be generalised by having variable
penalties for replacement and for insertion/deletion.

E.g. n/m has a low replacement penalty because they're both
phonetically very similar AND adjacent on some keyboards.

Google "zobel editex" for some ideas.

Insertion/deletion: a good tweak is to use a low (even zero) penalty
for omitting a doubled letter e.g. Matthew / Mathew.

Google "febrl" for a Python package for record matching -- the authors
have a recent paper where they compare various name-matching methods.

HTH,
John

> This message
[big snip]
has astonishingly large multi-lingual carbuncles on its rump. Please
consider posting from home.

> Ce message et toutes les pieces jointes (ci-apres le

[big snip]



More information about the Python-list mailing list