urllib2 and threading

Fri May 1 11:09:06 EDT 2009

Thanks for your reply. Obviously you make several good points about
Beautiful Soup and Queue. But here's the problem: even if I do nothing
whatsoever with the threads beyond just visiting the urls with
urllib2, the program chokes. If I replace

  else:
    ulock.acquire()
    print page.geturl() # obviously, do something more useful
here,eventually
    page.close()
    ulock.release()

with

  else:
    pass

the urllib2 starts raising URLErrros after the first 3 - 5 urls have
been visited. Do you have any sense what in the threads is corrupting
urllib2's behavior?  Many thanks,

Robean

On May 1, 12:27 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> robean <st1... at gmail.com> writes:
> > reach the urls with urllib2. The actual program will involve fairly
> > elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> > the example shown here is simplified and just confirms the url of the
> > site visited.
>
> Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
> of pages and have multiple cpu's, you probably want parallel processes
> rather than threads.
>
> > wrong? I am new to both threading and urllib2, so its possible that
> > the SNAFU is quite obvious..
> > ...
> > ulock = threading.Lock()
>
> Without looking at the code for more than a few seconds, using an
> explicit lock like that is generally not a good sign.  The usual
> Python style is to send all inter-thread communications through
> Queues.  You'd dump all your url's into a queue and have a bunch of
> worker threads getting items off the queue and processing them.  This
> really avoids a lot of lock-related headache.  The price is that you
> sometimes use more threads than strictly necessary.  Unless it's a LOT
> of extra threads, it's usually not worth the hassle of messing with
> locks.