urllib timeout issues

Nick Vatamaniuc vatamane at gmail.com
Tue Mar 27 23:11:15 EDT 2007


On Mar 27, 4:41 pm, "supercooper" <supercoo... at gmail.com> wrote:
> On Mar 27, 3:13 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> wrote:
>
>
>
> > En Tue, 27 Mar 2007 16:21:55 -0300, supercooper <supercoo... at gmail.com>
> > escribió:
>
> > > I am downloading images using the script below. Sometimes it will go
> > > for 10 mins, sometimes 2 hours before timing out with the following
> > > error:
>
> > >     urllib.urlretrieve(fullurl, localfile)
> > > IOError: [Errno socket error] (10060, 'Operation timed out')
>
> > > I have searched this forum extensively and tried to avoid timing out,
> > > but to no avail. Anyone have any ideas as to why I keep getting a
> > > timeout? I thought setting the socket timeout did it, but it didnt.
>
> > You should do the opposite: timing out *early* -not waiting 2 hours- and
> > handling the error (maybe using a queue to hold pending requests)
>
> > --
> > Gabriel Genellina
>
> Gabriel, thanks for the input. So are you saying there is no way to
> realistically *prevent* the timeout from occurring in the first
> place?  And by timing out early, do you mean to set the timeout for x
> seconds and if and when the timeout occurs, handle the error and start
> the process again somehow on the pending requests?  Thanks.
>
> chad

Chad,

Just run the retrieval in a Thread. If the thread is not done after x
seconds, then handle it as a timeout and then retry, ignore, quit or
anything else you want.

Even better, what I did for my program is first gather all the URLs (I
assume you can do that), then group by servers, i.e. n # of images
from foo.com, m # from bar.org .... Then start a thread for each
server (with some possible maximum number of threads), each one of
those threads will be responsible for retrieving images from only one
server (this is to prevent a DoS pattern). Let each of the server
threads start a 'small' retriever thread for each image (this is to
handle the timeout you mention).

So you have two different threads -- one per server to parallelize
downloading, which in turn will spawn and one per download to handle
timeout. This way you will (ideally) saturate your bandwidth but you
only get one image per server at a time so you still 'play nice' with
each of the servers.  If you want to have a max # of server threads
running (in case you have way to many servers to deal with) then run
batches of server threads.

Hope this helps,
Nick Vatamaniuc




More information about the Python-list mailing list