[XML-SIG] xml / html parsing for webbot

Bastian Kleineidam calvin@cs.uni-sb.de
Sun, 10 Dec 2000 15:35:14 +0100 (CET)


Hello Kent,

>2. I have think of not building a dom tree but using regular expressions  
> to extract all links. Can somebody tell me from their experience some
> comparision of the two approaches? What is better? Especially I found
> some pages which were generated by scripts, do contain unmatched tags in
> the pages. How the two approaches handle them?

I am using Regexps:
_linkMatcher = r"""
    (?i)           # case insensitive
    <              # open tag
    \s*            # whitespace
    %s             # tag name
    \s+            # whitespace
    [^>]*?         # skip leading attributes
    %s             # attrib name
    \s*            # whitespace
    =              # equal sign
    \s*            # whitespace
    (?P<value>     # attribute value
     ".*?" |       # in double quotes
     '.*?' |       # in single quotes
     [^\s>]+)      # unquoted
    ([^">]|".*?")* # skip trailing attributes
    >              # close tag
    """
# and now fill in some tags:
LinkPatterns = (
    re.compile(_linkMatcher % ("a", "href"), re.VERBOSE),
    re.compile(_linkMatcher % ("img",   "src"), re.VERBOSE),
    re.compile(_linkMatcher % ("form",  "action"), re.VERBOSE),
    re.compile(_linkMatcher % ("body",  "background"), re.VERBOSE),
    re.compile(_linkMatcher % ("frame", "src"), re.VERBOSE),
    re.compile(_linkMatcher % ("link",  "href"), re.VERBOSE),
    # <meta http-equiv="refresh" content="x; url=...">
    re.compile(_linkMatcher % ("meta",  "url"), re.VERBOSE),
    re.compile(_linkMatcher % ("area",  "href"), re.VERBOSE),
    re.compile(_linkMatcher % ("script", "src"), re.VERBOSE),
)

This regex even catches missing quotes:
<a href="bla>
<a href=bla">

But only if you strip leading and trailing quotes from the URL.
For a complete code example get Linkchecker:
http://linkchecker.sourceforge.net
and look in linkcheck/UrlData.py

Bastian