Jarow-Winkler algorithm: Measuring similarity between strings

Roger Binns rogerb at rogerbinns.com
Fri Dec 19 20:54:22 EST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Øyvind wrote:
> Based on examples and formulas from http://en.wikipedia.org/wiki/Jaro-Winkler.
> Useful for measuring similarity between two strings. For example if
> you want to detect that the user did a typo.

Jaro-Winkler is best when dealing with names (Winkler works for the US
census).  There are pure Python and C accelerated implementations at
http://bitpim.svn.sourceforge.net/viewvc/bitpim/trunk/bitpim/src/native/strings/


If you are concerned about typos then taking into account the keyboard
layout will help.  For example for a user with a US keyboard, the 'a' or
 'd' keys would be a common typo for 's'.

Also consider Levenshtein distance:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklMUEkACgkQmOOfHg372QRTlQCfUoebzX2HRbQ4wLVZ6yRFMHd7
9yMAnjovqefVuQenX0zpHwn/rvv9FLe+
=bACc
-----END PGP SIGNATURE-----




More information about the Python-list mailing list