Fuzzy string comparison
Jorge Godoy
jgodoy at gmail.com
Wed Dec 27 05:18:31 EST 2006
"Steve Bergman" <steve at rueb.com> writes:
> I'm looking for a module to do fuzzy comparison of strings. I have 2
> item master files which are supposed to be identical, but they have
> thousands of records where the item numbers don't match in various
> ways. One might include a '-' or have leading zeros, or have a single
> character missing, or a zero that is typed as a letter 'O'. That kind
> of thing. These tables currently reside in a mysql database. I was
> wondering if there is a good package to let me compare strings and
> return a value that is a measure of their similarity. Kind of like
> soundex but for strings that aren't words.
If you were using PostgreSQL there's a contrib package (pg_trgm) that could
help a lot with that. It can show you the distance between two strings based
on a trigram comparison.
You can see how it works on the README
(http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm) and maybe
port it for your needs.
But it probably won't be a one operation only search, you'll have to
post process results to decide what to do on multiple matches.
--
Jorge Godoy <jgodoy at gmail.com>
More information about the Python-list
mailing list