Jarow-Winkler algorithm: Measuring similarity between strings

Sat Dec 20 06:32:02 EST 2008

On Dec 20, 7:07 pm, Øyvind <oyvin... at gmail.com> wrote:
> Thanks for the useful comments.
>
> On 20 Des, 01:38, John Machin <sjmac... at lexicon.net> wrote:
>
> > On Dec 20, 10:02 am, Øyvind <oyvin... at gmail.com> wrote:
>
> > > Based on examples and formulas fromhttp://en.wikipedia.org/wiki/Jaro-Winkler.
>
> > For another Python implementation, google "febrl".
>
> > > Useful for measuring similarity between two strings. For example if
> > > you want to detect that the user did a typo.
>
> > You mean like comparing the user's input word with some collection of
> > valid words? You would need to be using something else as a quick-and-
> > dirty filter ... Jaro-Winkler is relatively slow.
>
> Do you have any alternative  suggestions?

Use a Levenshtein or Damerau edit distance. The latter handles
adjacent transpositions (...AB...->...BA...) which covers well over
90% of the transposition errors I've ever seen. The remainder
e.g. ...AXB...->...BXA... are given the same weight by Jaro-Winkler;
this is IMHO silly.

Read a little bit further than the CS101 level on Levenshtein edit
distance and ignore any implementation which calculates all N*M cells
of a dynamic programming matrix -- in the case where the objective is
to calculate the distance except that once the distance is known to be
over a threshold you don't care how big it is (you will reject a match
if the distance is too big), then you only need to fill in the entries
in a narrow band astride the trailing diagonal, and you can reliably
give up if the current distance on the diagonal becomes too high.
Ukkonen's the man in this field.

It gets better. You can forget the matrix when the word lengths are
less than the number of bits in your computer word (even 32 bits is
enough for most words in most languages). The bit-parallel techniques
include Damerau distance. This paper by Heikki Hyyrö is well worth
reading, and refers to a whole lot of previous work, including
Ukkonen's:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.2242

And you can forget about all the fancy techniques when the word
lengths are so different that they can't be the same (e.g. Fu and
Featherstonehaugh) -- or the shorter should perhaps be checked to see
if it's a plausible abbreviation or contraction of the longer.

> >  Also as the Wikipedia article says,
> > it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) +
> > dist(b, c).
>
> Its not a mathematical correct metric, but it is a useful heuristical
> metric.

Useful for what? IMHO: If it doesn't satisfy the symmetry condition
and the triangle rule, it's just plain dodgy. J-W is dodgy and slow.
Might as well use soundex. End of story.

>
> > The above code is not symmetrical; jarow_m(s1, s2) does not
> > necessarily equal jarow_m(s2, s1). The article talks about one "m",
> > not two of them.
>
> Hmm.. also a good point. I will make it count all the matches, not
> just the matches in s1.

The whole Jaro definition is woolly ... what is a match when you have
more than one occurrence of the same letter? Have a look at what the
matches might be between MISSISSIPPI and a variant with an extra I in
the middle e.g MISSISISIPPI.

Cheers,
John