Regular Expression Help Needed (Starting Point)

Brad Bollenbach bbollenbach at home.com
Sat Nov 17 13:02:51 EST 2001


Here's some code to get you started:

import htmllib
import formatter

class EmailParser(htmllib.HTMLParser):
    "I know how to rip the email addresses out of an HTML document."
    def __init__(self, formatter):
        htmllib.HTMLParser.__init__(self, formatter)
        self.__email_addresses = []

    def anchor_bgn(self, href, name, type):
        # this function will get called whenever an anchor is found in the
HTML document
        # in here, add a regex to check that the first word characters of
the href are "mailto:", then
        # extract the text following that. Add an email address verification
function (such as
        # the one found in the Python Cookbook) to check that text, and if
all is good, add it
        # to your list of email addresses (in self.__email_addresses). For
the regex's
        # http://py-howto.sourceforge.net/regex/regex.html will show you
what to do.
        print "Anchor:\thref='%s' name='%s' type='%s'" % ( href, name,
type )

    def getAddresses(self):
        return self.__email_addresses

parser = EmailParser(formatter.NullFormatter())
parser.feed(your_html_string)
parser.close()

"Brad Bollenbach" <bbollenbach at home.com> wrote in message
news:UsvJ7.42389$J62.7116531 at news1.rdc1.mb.home.com...
> "David A McInnis" <david at dataovation.com> wrote in message
> news:mailman.1005960683.29958.python-list at python.org...
> > Ok, I am not very good at regular expressions, so any help is greatly
> > appreciated.
> >
> > Here is the situation.
> >
> > I have a database table that contains a text field.  This text field may
> or
> > may not contain email addresses.  I know how to read the content of
field
> to
>
> Why "might" it not contain an email address? Specifically, what kind of
data
> is in the column? What format? An HTML webpage, XML, etc.? Either way, you
> probably don't want to solve this via a regular expression: that takes too
> much effort to get right.
>
> If it's HTML, HTMLParser (in the htmllib module) makes the solution
trivial.
>
> Hope that helps,
>
> Brad
>
>





More information about the Python-list mailing list