urllib2 and threading

Fri May 1 14:18:10 EDT 2009

For better performance, lxml easily outperforms Beautiful Soup.

For what its worth, the code runs fine if you switch from urllib2 to
urllib (different exceptions are raised, obviously). I have no
experience using urllib2 in a threaded environment, so I'm not sure
why it breaks; urllib does OK, though.

- Shailen

On May 1, 9:29 am, Stefan Behnel <stefan... at behnel.de> wrote:
> robean wrote:
> > I am writing a program that involves visiting several hundred webpages
> > and extracting specific information from the contents. I've written a
> > modest 'test' example here that uses a multi-threaded approach to
> > reach the urls with urllib2. The actual program will involve fairly
> > elaborate scraping and parsing (I'm using Beautiful Soup for that)
>
> Try lxml.html instead. It often parses HTML pages better than BS, can parse
> directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
> faster and more memory friendly than the combination of urllib2 and BS,
> especially when threading is involved. It also supports CSS selectors for
> finding page content, so your "elaborate scraping" might actually turn out
> to be a lot simpler than you think.
>
> http://codespeak.net/lxml/
>
> These might be worth reading:
>
> http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-sc...http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Stefan