matching a street address with regular expressions

John Machin sjmachin at lexicon.net
Fri Oct 12 19:50:52 EDT 2007


On Oct 13, 4:32 am, Paul McGuire <pt... at austin.rr.com> wrote:
> On Oct 12, 8:19 am, John Machin <sjmac... at lexicon.net> wrote:
>
> > "... most of the developed world" was the [very optimistic] request.
> > How does it go with "JAPAN 112-0001 TOKYO Bunkyo-Ku Hakusan 4-Chome 3-
> > 2" and will it give the same result for "4-3-2 HAKUSAN BUNKYO-KU TOKYO
> > 112-00001 JAPAN"? OK, a little exotic ... closer to "home", what about
> > addresses in Quebec? People often write addresses in formats that you
> > won't find on the postal service website, but the local postal workers
> > will still deliver. Rural addresses can be quaintly medieval e.g. "Lot
> > 123, Hundred of Foughbarre" [South Australia]. Etc etc ...
>
> John -
>
> As I'm sure you already know, the answer is "not too well".

Yup.

> In the spectrum of tools from string.split, to re, to pyparsing and
> its ilk, to natural language toolkits, the broader the scope of your
> "address space", the further you find your self moving toward NLTK.

Yup. Each different postal jurisdiction has its own official language
for addresses, and for each official language there are regional etc
dialects and slang and ...

> I
> knew my pyparsing script wasn't in the "anywhere in the developed
> world" category, that's why I posted the test cases, too.

Test cases which presume the OP's assumption that the street address
is available separately (or that the locality, state, and zipcode have
miraculously been parsed out already) and that the country is the USA.

In real world databases that I've come across, data flows across field
boundaries so freely that one just has to give up on individual fields
and work on something like ' '.join([street1, street2, locality,
state, zipcode, country]). With luck, the country will have been
filled in with a (fuzzily-)recognisable country name/code/
abbreviation ... otherwise it's off to a country_guessing module.
Quick quiz question: which 3 postal jurisdictions have a state/
province/region abbreviated to "NT"?

You could even be cleaning up after a butcher who thought he was
cleaning up, creating locality/country combinations like this:

CHADSTONE -> STONE/CHAD
COROMANDEL VALLEY -> CORDEL VALLEY/OMAN
PRINCE OF WALES HOSPITAL -> PRINCE OF  HOSPITAL/WALES

>
> If you've got an re that can handle everything from "123 Main" to
> "221B Baker Street" to "Hollywood and Vine" to "Lot 123, Hundred of
> Foughbarre", now THAT would be something.

Of course "an re" can't possibly handle everything for postal
addresses.  See Friedl's attempt to provide an re to match all
possible e-mail addresses -- a somewhat more restricted domain than
postal addresses.




More information about the Python-list mailing list