Usable street address parser in Python?

Tue Apr 20 05:23:15 EDT 2010

On Apr 20, 8:24 am, John Yeung <gallium.arsen... at gmail.com> wrote:
> My response is similar to John Roth's.  It's mainly just sympathy. ;)
>
> I deal with addresses a lot, and I know that a really good parser is
> both rare/expensive to find and difficult to write yourself.  We have
> commercial, USPS-certified products where I work, and even with those
> I've written a good deal of pre-processing and post-processing code,
> consisting almost entirely of very silly-looking fixes for special
> cases.
>
> I don't have any experience whatsoever with pyparsing, but I will say
> I agree that you should try to get the street type from the end of the
> line.  Just be aware that it can be valid to leave off the street type
> completely.  And of course it's a plus if you can handle suites that
> are on the same line as the street (which is where the USPS prefers
> them to be).
>
> I would take the approach which John R. seems to be suggesting, which
> is to tokenize and then write a whole bunch of very hairy, special-
> case-laden logic. ;)  I'm almost positive this is what all the
> commercial packages are doing, and I have a tough time imagining what
> else you could do.  Addresses inherently have a high degree of
> irregularity.
>
> Good luck!
>
> John Y.

Not sure on the volume of addresses you're working with, but as an
alternative you could try grabbing the zip code, looking up all
addresses in that zip code, and then finding whatever one of those
address strings most closely resembles your address string (smallest
Levenshtein distance?).

Iain