Fuzzy string comparison

Jorge Godoy jgodoy at gmail.com
Wed Dec 27 05:18:31 EST 2006


"Steve Bergman" <steve at rueb.com> writes:

> I'm looking for a module to do fuzzy comparison of strings.  I have 2
> item master files which are supposed to be identical, but they have
> thousands of records where the item numbers don't match in various
> ways.  One might include a '-' or have leading zeros, or have a single
> character missing, or a zero that is typed as a letter 'O'.  That kind
> of thing.  These tables currently reside in a mysql database.  I was
> wondering if there is a good package to let me compare strings and
> return a value that is a measure of their similarity.  Kind of like
> soundex but for strings that aren't words.

If you were using PostgreSQL there's a contrib package (pg_trgm) that could
help a lot with that.  It can show you the distance between two strings based
on a trigram comparison.

You can see how it works on the README
(http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm) and maybe
port it for your needs.

But it probably won't be a one operation only search, you'll have to
post process results to decide what to do on multiple matches.

-- 
Jorge Godoy      <jgodoy at gmail.com>



More information about the Python-list mailing list