[Chicago] Address parser?

Phil Robare verisimilidude at gmail.com
Mon Feb 25 18:02:58 CET 2008


Address parsing is a hard problem.  Not in the theoretical NP sense,
but in that it requires a lot of knowledge of special cases.
Addresses can be ambiguous or not depending upon information that the
application 'just has to know'.  For instance an address in Chicago of
320 Randolph is ambiguous - It could be east or west.  But an address
of 1320 Randolph is merely incomplete, needing West as part of the
street name. If the user dropped the space you could figure out where
1320 westrandolph street was.  But a Westmont Street would just be a
street named after a suburb.  It would probably be the same as an
address on Westmont Ave.  But Atlanta is (in)famous for having
multiple different roads all named Peachtree but having different
suffixes, e.g. Road, Avenue, Boulevard.  Usually digits are part of
the address and words are part of the street name.  Detroit, for
example, confounds things with "8 Mile Road". In many places a street
has multiple names, bearing both the local name and the highway route
name, so you get an address like 185 Rt 45.  There are people with the
last name of "Street" that have had a road named after them.  While
the block number might be useful for figuring out west or east in
Chicago, in the suburbs it can be a mess.  Arlington Heights Road goes
through a number of suburbs, many of them having their own numbering
system and their own east/west dividing point.  These addresses can be
ambiguous because no one knows which suburb they are in as they drive
along it. Most addresses are whole numbers but within the US there are
a number of places that use fractions (like 1/2) to specify part of a
duplex, and there are even places that use decimals in the address in
place of apartment numbers.  Another problem is that there are
multiple towns with the same name in some states, so the county has to
be part of the address (or the zip code has to be checked).

So, as far as I know, there are no good public domain address parsers
because of the amount of work it takes to create one and the
dependence of the parsing upon an underlying map.  If you are a direct
marketer mailing hundreds of pieces the post office parser may be a
good choice.  But if you are working for a web retailer who would just
like to make sure the user typed an address that can be mailed to I
think the Google API would be an option (depending upon terms of use -
I don't know how restricted they are with regards to businesses using
it.)  Navteq and Teleatlas have commercial offerings that I am not
very familiar with.

Asking the person entering the data to put in a house number field, a
street name, a street type, direction suffix/prefix, etc. can make the
job of the coder easier but will frustrate those who have to enter an
address that doesn't fit the model.

Phil


More information about the Chicago mailing list