urllib2 and threading

Fri May 1 12:29:22 EDT 2009

robean wrote:
> I am writing a program that involves visiting several hundred webpages
> and extracting specific information from the contents. I've written a
> modest 'test' example here that uses a multi-threaded approach to
> reach the urls with urllib2. The actual program will involve fairly
> elaborate scraping and parsing (I'm using Beautiful Soup for that)

Try lxml.html instead. It often parses HTML pages better than BS, can parse
directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
faster and more memory friendly than the combination of urllib2 and BS,
especially when threading is involved. It also supports CSS selectors for
finding page content, so your "elaborate scraping" might actually turn out
to be a lot simpler than you think.

http://codespeak.net/lxml/

These might be worth reading:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan