Fuzzy matching of postal addresses
Skip Montanaro
skip at pobox.com
Mon Jan 17 22:11:36 EST 2005
Andrew> I'm 90% of the way there, in the sense that I have a simplistic
Andrew> approach that matches 90% of the addresses in database A. But
Andrew> the extra cases could be a pain to deal with!
Based upon the examples you gave, here are a couple things you might try to
reduce the size of the difficult comparisons:
* Remove "the" and commas as part of your normalization process
* Split each address on white space and convert the resulting list to a
set, then consider the size of the intersection with other addresses
with the same postal code:
>>> a1 = "St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL".upper().replace(",", "")
>>> a1
"ST JOHN'S PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL"
>>> a2 = "THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL".upper().replace(",", "").replace("THE ", "")
>>> a2
'PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL'
>>> a1 == a2
False
>>> sa1 = set(a1.split())
>>> sa2 = set(a2.split())
>>> len(sa1)
8
>>> len(sa2)
6
>>> len(sa1 & sa2)
6
Skip
More information about the Python-list
mailing list