urllib2 and threading

Fri May 1 03:27:01 EDT 2009

robean <st1999 at gmail.com> writes:
> reach the urls with urllib2. The actual program will involve fairly
> elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> the example shown here is simplified and just confirms the url of the
> site visited.

Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
of pages and have multiple cpu's, you probably want parallel processes
rather than threads.

> wrong? I am new to both threading and urllib2, so its possible that
> the SNAFU is quite obvious..
> ...
> ulock = threading.Lock()

Without looking at the code for more than a few seconds, using an
explicit lock like that is generally not a good sign.  The usual
Python style is to send all inter-thread communications through
Queues.  You'd dump all your url's into a queue and have a bunch of
worker threads getting items off the queue and processing them.  This
really avoids a lot of lock-related headache.  The price is that you
sometimes use more threads than strictly necessary.  Unless it's a LOT
of extra threads, it's usually not worth the hassle of messing with
locks.