Something faster then sgmllib for sucking out URLs

Thu Jun 13 02:57:08 EDT 2002

Alex Polite <m2 at plusseven.com> writes:

> I'm working on a webspider to fit my sick needs. The profiler
> tells me that about 95% of the time is spent in sgmllib. I use sgmllib
> solely for extracting URLs. I'm looking for a faster way of doing
> this. Regular expressions, string searches? What's the way to go? I'm
> not a python purist. Calling some fast C program with the html as
> argument and getting back a list of URLs would be fine by me.

I recommend to use sgmlop, which is distributed both as part of PyXML,
and separately by Fredrik Lundh. It is the fastest SGML/XML parser I
know of, for use within Python.

Regards,
Martin