Fuzzy matching of postal addresses

Joseph Turian turian at gmail.com
Sun Jan 23 23:52:10 EST 2005


Andrew,

> Basically, I have two databases containing lists of postal addresses
and
> need to look for matching addresses in the two databases. More
> precisely, for each address in database A I want to find a single
> matching address in database B.

What percent of addresses in A have a unique corresponding address in
B? (i.e. how many addresses will have some match in B?)

This is a standard document retrieval task. Whole books could be
written about the topic. (In fact, many have been).

I suggest you don't waste your time trying to solve this problem from
scratch, and instead capitalize on the effort of others. Hence, my
proposal is pretty simple:
1. Regularize the punctuation of the text (e.g. convert it all to
uppercase), since it is uninformative and---at best---a confounding
variable.
2. Use a free information retrieval package to find matches.
e.g. LEMUR: http://www-2.cs.cmu.edu/~lemur/

In this case, a "document" is an address in Database B. A "query" is an
address in Database A. (Alternately, you could switch A and B to see if
that affects accuracy.)

Good luck.

   Joseph




More information about the Python-list mailing list