Something faster then sgmllib for sucking out URLs

Wed Jun 12 18:15:45 EDT 2002

On Wed, 12 Jun 2002, Alex Polite wrote:

> I'm working on a webspider to fit my sick needs. The profiler
> tells me that about 95% of the time is spent in sgmllib. I use sgmllib
> solely for extracting URLs. I'm looking for a faster way of doing
> this. Regular expressions, string searches? What's the way to go? I'm
> not a python purist. Calling some fast C program with the html as
> argument and getting back a list of URLs would be fine by me.

How fast do you need it to be? Using regular expressions + other junk (see
below) I get about 1 MB/sec on a 900 MHz Pentium III - that's a lot of
HTML and probably faster than you'll be reading it off the 'net. Also, you
can probably simplify it as I needed to get nearly all URLs (not just
hrefs); also I specifically ignore images and uglified it in a few more 
places for speed reasons.

The code below returns a list of start/end URL indices but you can easily
modify it to return the actual URLs (make sure you return them from the
original buffer because for speed this function lowercases the entire web
page first and some web servers are case sensitive). Disclaimer: I've
added to this function over time as I've encountered weird cases, but that
obviously doesn't mean it now handles all weird cases. :)

-Dave

# Note that group(1) of validTagRE tells which tag we matched
validTagRE = re.compile('<(a|embed|object|param)\s.*?>', re.DOTALL)
urlStartRE = re.compile('(href|src|codebase|value)\s*=\s*[\'"]?\s*')
urlEndRE = re.compile('\s+|>')
def FindLinks(data):
    'Returns a list of (start, end) url indices'
    data = data.lower() # much faster than using re.I
    cur = 0
    links = []
    while 1:
        # Locate a tag we're interested in
        tagMatch = validTagRE.search(data, cur)
        if not tagMatch: break
        tagEnd = tagMatch.end()
        cur = tagEnd + 1

        # Now search for attributes within that tag that contain urls
        urlCur = tagMatch.end(1) + 1 # e.g. for '<embed ...', start=7
        while 1:
            urlMatch = urlStartRE.search(data, urlCur, tagEnd)
            if not urlMatch: break

            # Find the end of the url - extra work required because we
            # find HTML like href=foo, href="foo" href=" foo" href='foo', 
etc.
            urlStart = urlMatch.end()
            urlEnd = -1
            lastChar = urlMatch.group().strip()[-1]
            if lastChar == '=':
                # URL not quoted, end is whitespace or '>'
                endMatch = urlEndRE.search(data, urlStart, tagEnd)
                if not endMatch: break # Bleh! Bad HTML!

                urlEnd = endMatch.start()
                if urlEnd > tagEnd: break # More bad HTML!

                # One last check: In Javascript, might have escaped quotes
                if data[urlStart:urlStart+2] == '\\"' and \
                   data[urlEnd-2:urlEnd] == '\\"':
                    urlStart += 2
                    urlEnd -= 2
            elif lastChar == '"':
                # End will be a double quote
                if data[urlStart] != "'": # Make sure not part of 
Javascript
                    urlEnd = data.find('"', urlStart)
            elif lastChar == "'":
                # End will be a single quote
                if data[urlStart] != '"': # Make sure not part of 
Javascript
                    urlEnd = data.find("'", urlStart)

            if urlEnd != -1:
                # Skip not-really-URLs
                ok = 1
                ch = data[urlStart]
                if ch == 'j' or ch == 'm':
                    if data[urlStart:urlStart+11] == 'javascript:' or \
                       data[urlStart:urlStart+7] == 'mailto:':
                        ok = 0
                if ok:
                    links.append((urlStart, urlEnd))
            else:
                urlEnd = tagEnd

            urlCur = urlEnd + 1
            if urlCur >= tagEnd:
                break

    return links