Fuzzy matching of postal addresses
Joseph Turian
turian at gmail.com
Sun Jan 23 23:52:10 EST 2005
Andrew,
> Basically, I have two databases containing lists of postal addresses
and
> need to look for matching addresses in the two databases. More
> precisely, for each address in database A I want to find a single
> matching address in database B.
What percent of addresses in A have a unique corresponding address in
B? (i.e. how many addresses will have some match in B?)
This is a standard document retrieval task. Whole books could be
written about the topic. (In fact, many have been).
I suggest you don't waste your time trying to solve this problem from
scratch, and instead capitalize on the effort of others. Hence, my
proposal is pretty simple:
1. Regularize the punctuation of the text (e.g. convert it all to
uppercase), since it is uninformative and---at best---a confounding
variable.
2. Use a free information retrieval package to find matches.
e.g. LEMUR: http://www-2.cs.cmu.edu/~lemur/
In this case, a "document" is an address in Database B. A "query" is an
address in Database A. (Alternately, you could switch A and B to see if
that affects accuracy.)
Good luck.
Joseph
More information about the Python-list
mailing list