Fuzzy matching of postal addresses

Andrew McLean spam-trap-095 at at-andros.demon.co.uk
Tue Jan 18 16:26:40 EST 2005


Thanks for all the suggestions. There were some really useful pointers.

A few random points:

1. Spending money is not an option, this is a 'volunteer' project. I'll 
try out some of the ideas over the weekend.

2. Someone commented that the data was suspiciously good quality. The 
data sources are both ones that you might expect to be authoritative. If 
you use as a metric, having a correctly formatted and valid postcode, in 
one database 100% the records do in the other 99.96% do.

3. I've already noticed duplicate addresses in one of the databases.

4. You need to be careful doing an endswith search. It was actually my 
first approach to the house name issue. The problem is you end up 
matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

I am tempted to try an approach based on splitting the address into a 
sequence of normalised tokens. Then work with a metric based on the 
differences between the sequences. The simple case would look at 
deleting tokens and perhaps concatenating tokens to make a match.

-- 
Andrew McLean



More information about the Python-list mailing list