Something faster then sgmllib for sucking out URLs

Thu Jun 13 06:24:18 EDT 2002

Martin v. Loewis wrote:
>
> > I'm working on a webspider to fit my sick needs. The profiler
> > tells me that about 95% of the time is spent in sgmllib. I use sgmllib
> > solely for extracting URLs. I'm looking for a faster way of doing
> > this. Regular expressions, string searches? What's the way to go? I'm
> > not a python purist. Calling some fast C program with the html as
> > argument and getting back a list of URLs would be fine by me.
>
> I recommend to use sgmlop, which is distributed both as part of PyXML,
> and separately by Fredrik Lundh. It is the fastest SGML/XML parser I
> know of, for use within Python.

the latest version (1.1a3) is available here:

    http://effbot.org/downloads/

here's a code snippet that extracts A HREF anchors
from a webpage:

import sgmlop
import urllib

class AnchorHandler:
    def __init__(self):
        self.anchors = []
    def finish_starttag(self,tag,attrs):
        if tag == "a":
            for k, v in attrs:
                if k == "href":
                    self.anchors.append(v)

def getanchors(page):
    handler = AnchorHandler()
    parser = sgmlop.SGMLParser()
    parser.register(handler)
    parser.feed(urllib.urlopen(page).read())
    parser.close() # we're done
    return handler.anchors

print getanchors("http://www.python.org")

</F>