Something faster then sgmllib for sucking out URLs
Fredrik Lundh
fredrik at pythonware.com
Thu Jun 13 06:24:18 EDT 2002
Martin v. Loewis wrote:
>
> > I'm working on a webspider to fit my sick needs. The profiler
> > tells me that about 95% of the time is spent in sgmllib. I use sgmllib
> > solely for extracting URLs. I'm looking for a faster way of doing
> > this. Regular expressions, string searches? What's the way to go? I'm
> > not a python purist. Calling some fast C program with the html as
> > argument and getting back a list of URLs would be fine by me.
>
> I recommend to use sgmlop, which is distributed both as part of PyXML,
> and separately by Fredrik Lundh. It is the fastest SGML/XML parser I
> know of, for use within Python.
the latest version (1.1a3) is available here:
http://effbot.org/downloads/
here's a code snippet that extracts A HREF anchors
from a webpage:
import sgmlop
import urllib
class AnchorHandler:
def __init__(self):
self.anchors = []
def finish_starttag(self,tag,attrs):
if tag == "a":
for k, v in attrs:
if k == "href":
self.anchors.append(v)
def getanchors(page):
handler = AnchorHandler()
parser = sgmlop.SGMLParser()
parser.register(handler)
parser.feed(urllib.urlopen(page).read())
parser.close() # we're done
return handler.anchors
print getanchors("http://www.python.org")
</F>
More information about the Python-list
mailing list