newbie question: parsing street name from address

Eric venner at gmail.com
Fri Jun 22 11:43:46 EDT 2007


On Jun 21, 6:03 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jun 22, 4:43 am, Eric <ven... at gmail.com> wrote:
>
>
>
> > On Jun 21, 9:47 am, cjl <cjl... at gmail.com> wrote:
>
> > > P:
>
> > > I am working on a project that requires geocoding, and have written a
> > > very simple geocoder that uses the Google service.
>
> > > I would like to be able to extract the name of the street from the
> > > addresses in my data, however they vary significantly. Here a some
> > > examples:
>
> > > 25 Main St
> > > 2500 14th St
> > > 12 Bennet Pkwy
> > > Pearl St
> > > Bennet Rd and Main st
> > > 19th St
>
> > > As you can see, sometimes I have the house number, and sometimes I do
> > > not. Sometimes the street name is a number. Sometimes I simply have
> > > the names of intersecting streets.
>
> > > I would like to be able to parse the above into the following:
>
> > > Main St
> > > 14th St
> > > Bennet Pkwy
> > > Pearl St
> > > Bennet Rd
> > > Main St
> > > 19th St
>
> > > How might I approach this complex parsing problem?
>
> > > -CJL
>
> > You might be able to use consistencies in your data to make this
> > simpler.  If the examples you have there are representative, it looks
> > like what you should do is look for a word like 'St' or 'Rd' and then
> > return that word and the previous word.
>
> The OP's data already contains
>     [corner|cnr [of]] Foo Rd and|& Bar St
> and real world data will contain things like
>     1234 John F Kennedy Memorial Drive
>     456 Broadway
>
> As Paul wrote, "Parsing street addresses is a very complex parsing
> problem", even when you restrict yourself to one mostly-English-
> speaking country. Software written under such restrictions rapidly
> breaks down elsewhere (Rue de la Paix, Wilhelmstrasse, Avenida 9 de
> Julio, etc) and blows up altogether when street names aren't used in
> postal addresses (e.g. Japan).

No doubt that address parsing is, in general, a very difficult
problem.  However, it may not be necessary for him to solve the
general problem.  If his dataset is more limited in formats then his
problem is much simpler.




More information about the Python-list mailing list