newbie question: parsing street name from address

Fri Jun 22 17:55:37 EDT 2007

On Jun 23, 1:43 am, Eric <ven... at gmail.com> wrote:
> On Jun 21, 6:03 pm, John Machin <sjmac... at lexicon.net> wrote:
>
>
>
> > On Jun 22, 4:43 am, Eric <ven... at gmail.com> wrote:
>
> > > On Jun 21, 9:47 am, cjl <cjl... at gmail.com> wrote:
>
> > > > P:
>
> > > > I am working on a project that requires geocoding, and have written a
> > > > very simple geocoder that uses the Google service.
>
> > > > I would like to be able to extract the name of the street from the
> > > > addresses in my data, however they vary significantly. Here a some
> > > > examples:
>
> > > > 25 Main St
> > > > 2500 14th St
> > > > 12 Bennet Pkwy
> > > > Pearl St
> > > > Bennet Rd and Main st
> > > > 19th St
>
> > > > As you can see, sometimes I have the house number, and sometimes I do
> > > > not. Sometimes the street name is a number. Sometimes I simply have
> > > > the names of intersecting streets.
>
> > > > I would like to be able to parse the above into the following:
>
> > > > Main St
> > > > 14th St
> > > > Bennet Pkwy
> > > > Pearl St
> > > > Bennet Rd
> > > > Main St
> > > > 19th St
>
> > > > How might I approach this complex parsing problem?
>
> > > > -CJL
>
> > > You might be able to use consistencies in your data to make this
> > > simpler.  If the examples you have there are representative, it looks
> > > like what you should do is look for a word like 'St' or 'Rd' and then
> > > return that word and the previous word.
>
> > The OP's data already contains
> >     [corner|cnr [of]] Foo Rd and|& Bar St
> > and real world data will contain things like
> >     1234 John F Kennedy Memorial Drive
> >     456 Broadway
>
> > As Paul wrote, "Parsing street addresses is a very complex parsing
> > problem", even when you restrict yourself to one mostly-English-
> > speaking country. Software written under such restrictions rapidly
> > breaks down elsewhere (Rue de la Paix, Wilhelmstrasse, Avenida 9 de
> > Julio, etc) and blows up altogether when street names aren't used in
> > postal addresses (e.g. Japan).
>
> No doubt that address parsing is, in general, a very difficult
> problem.  However, it may not be necessary for him to solve the
> general problem.  If his dataset is more limited in formats then his
> problem is much simpler.

Ignore the last sentence of my post. Restrict the application to
[sub]urban addresses in the USA. If the OP's dataset is real-world
data, it will contain cases of street addresses that don't fit "look
for a word like 'St' or 'Rd' and then return that word and the
previous word."  To expect an OP in a newsgroup to provide
representative examples is charmingly naive :-) and in any case the OP
had already provided a corner case [pun intended] that busted your
rule.

Cheers,
John