html link parser?
The Dude
calvin at studcs.uni-sb.de
Sun Jul 16 07:03:55 EDT 2000
Paul Johnston wrote:
>
> I'm looking to parse the links out of an html document. Is there a module
> analogous to Html::LinkExtor (perl)?
>
> Thanks for your help,
> Paul
This is part of my LinkChecker (linkchecker.sourceforge.net).
The code does not honor a <base href=> tag, but if you look into the
complete source code, you can find it there.
_linkMatcher = r"""
(?i) # case insensitive
< # open tag
\s* # whitespace
%s # tag name
\s+ # whitespace
[^>]*? # skip leading attributes
%s # attrib name
\s* # whitespace
= # equal sign
\s* # whitespace
(?P<value> # attribute value
".*?" | # in double quotes
'.*?' | # in single quotes
[^\s>]+) # unquoted
[^>]* # skip trailing attributes
> # close tag
"""
# note: you can use other link tags too like "img src=..."
pattern = re.compile(_linkMatcher % ("a", "href"), re.VERBOSE)
htmltext="<html>...</html>"
index = 0
while 1:
match = pattern.search(htmltext, index)
if not match: break
index = match.end()
print "Found url %s" % match.group('value')
More information about the Python-list
mailing list