Comparing 2 similar strings?

Roger Binns rogerb at rogerbinns.com
Thu May 26 01:41:21 EDT 2005


<chrisconnett at gmail.com> wrote in message news:1116487353.934354.195940 at g43g2000cwa.googlegroups.com...
> ***Check out difflib, it's in the library.***  Perfect package for what
> the OP wants AFAICT.

The method in difflib is okay, but doesn't do that good a job.  It
is also relatively slow.  My need for this was matching records in
BitPim (eg which entries just read from a cell phone correspond to
which entries just read from Outlook)

Here are some links for the OP as well as ready to run code at the
end.

A paper on various algorithms and their effectiveness on various data
sets.

http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf

They do have a Java library of all the algorithms by the same authors.

http://secondstring.sourceforge.net/

There is a group at the Australian National University with a project named
Febrl - Freely extensible biomedical record linkage.  The idea is to be able
to work out if two records from the same or different sources are the same
person.  All their code is in Python:

http://datamining.anu.edu.au/software/febrl/febrldoc/

The solution I settled on is using Jaro-Winkler. This is a quote from the
paper above:

  A surprisingly good distance metric is a fast heuristic scheme, proposed
  by Jaro (Jaro 1995; 1989) and later extended by Winkler (Winkler 1999).
  This works almost as well as the Monge-Elkan scheme, but
  is an order of magnitude faster

I did find a Python module that implemented this (it is part of the
python-levenshtein package).  Unfortunately there were some issues
at the time dealing with some strings that may be unicode and some
not.  There was one implementation I saw that took copies of both
strings and was replacing characters with asterisks.  My data sets
include meaningful asterisks so that seriously munged things.

Ultimately I wrote my own implementation with the following properties:

 - The arguments can be any combination of unicode and ordinary strings
 - There are no special characters nor ones used internally as replacements
 - A bitset is used for tracking so you use an eigth of the memory
 - There is both Python and C code implementations so you don't have
   to compile anything but can if you want

The code can be grabbed from:

http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/

There is a detailed description at the begining of the Python implementation:

http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/jarowpy.py?view=markup

Roger 





More information about the Python-list mailing list