html link parser?

Sun Jul 16 07:03:55 EDT 2000

Paul Johnston wrote:
> 
> I'm looking to parse the links out of an html document.  Is there a module
> analogous to Html::LinkExtor (perl)?
> 
> Thanks for your help,
> Paul
This is part of my LinkChecker (linkchecker.sourceforge.net).
The code does not honor a <base href=> tag, but if you look into the
complete source code, you can find it there.

_linkMatcher = r"""
    (?i)          # case insensitive
    <             # open tag
    \s*           # whitespace
    %s            # tag name
    \s+           # whitespace
    [^>]*?        # skip leading attributes
    %s            # attrib name
    \s*           # whitespace
    =             # equal sign
    \s*           # whitespace
    (?P<value>    # attribute value
     ".*?" |      # in double quotes
     '.*?' |      # in single quotes
     [^\s>]+)     # unquoted
    [^>]*         # skip trailing attributes
    >             # close tag
    """
# note: you can use other link tags too like "img src=..."
pattern = re.compile(_linkMatcher % ("a", "href"), re.VERBOSE)
htmltext="<html>...</html>"
index = 0
while 1:
    match = pattern.search(htmltext, index)
    if not match: break
    index = match.end()
    print "Found url %s" % match.group('value')