Fuzzy matching of postal addresses

Andrew McLean spam-trap-095 at at-andros.demon.co.uk
Mon Jan 17 19:02:07 EST 2005


I have a problem that is suspect isn't unusual and I'm looking to see if 
there is any code available to help. I've Googled without success.

Basically, I have two databases containing lists of postal addresses and 
need to look for matching addresses in the two databases. More 
precisely, for each address in database A I want to find a single 
matching address in database B.

I'm 90% of the way there, in the sense that I have a simplistic approach 
that matches 90% of the addresses in database A. But the extra cases 
could be a pain to deal with!

It's probably not relevant, but I'm using ZODB to store the databases.

The current approach is to loop over addresses in database A. I then 
identify all addresses in database B that share the same postal code 
(typically less than 50). The database has a mapping that lets me do 
this efficiently. Then I look for 'good' matches. If there is exactly 
one I declare a success. This isn't as efficient as it could be, it's 
O(n^2) for each postcode, because I end up comparing all possible pairs. 
But it's fast enough for my application.

The problem is looking for good matches. I currently normalise the 
addresses to ignore some irrelevant issues like case and punctuation, 
but there are other issues.

Here are just some examples where the software didn't declare a match:

1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

The challenge is to fix some of the false negatives above without 
introducing false positives!

Any pointers gratefully received.

-- 
Andrew McLean



More information about the Python-list mailing list