Do I have to use threads?

Wed Jan 13 12:09:59 EST 2010

On Jan 7, 5:38 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> Jorgen Grahn wrote:
> > On Thu, 2010-01-07, Marco Salden wrote:
> >> On Jan 6, 5:36 am, Philip Semanchuk <phi... at semanchuk.com> wrote:
> >>> On Jan 5, 2010, at 11:26 PM, aditya shukla wrote:
>
> >>>> Hello people,
> >>>> I have 5 directories corresponding 5  different urls .I want to  
> >>>> download
> >>>> images from those urls and place them in the respective  
> >>>> directories.I have
> >>>> to extract the contents and download them simultaneously.I can  
> >>>> extract the
> >>>> contents and do then one by one. My questions is for doing it  
> >>>> simultaneously
> >>>> do I have to use threads?
> >>> No. You could spawn 5 copies of wget (or curl or a Python program that  
> >>> you've written). Whether or not that will perform better or be easier  
> >>> to code, debug and maintain depends on the other aspects of your  
> >>> program(s).
>
> >>> bye
> >>> Philip
> >> Yep, the more easier and straightforward the approach, the better:
> >> threads are always (programmers')-error-prone by nature.
> >> But my question would be: does it REALLY need to be simultaneously:
> >> the CPU/OS only has more overhead doing this in parallel with
> >> processess. Measuring sequential processing and then trying to
> >> optimize (e.g. for user response or whatever) would be my prefered way
> >> to go. Less=More.
>
> > Normally when you do HTTP in parallell over several TCP sockets, it
> > has nothing to do with CPU overhead. You just don't want every GET to
> > be delayed just because the server(s) are lazy responding to the first
> > few ones; or you might want to read the text of a web page and the CSS
> > before a few huge pictures have been downloaded.
>
> > His "I have to [do them] simultaneously" makes me want to ask "Why?".
>
> > If he's expecting *many* pictures, I doubt that the parallel download
> > will buy him much.  Reusing the same TCP socket for all of them is
> > more likely to help, especially if the pictures aren't tiny. One
> > long-lived TCP connection is much more efficient than dozens of
> > short-lived ones.
>
> > Personally, I'd popen() wget and let it do the job for me.
>
>  From my own experience:
>
> I wanted to download a number of webpages.
>
> I noticed that there was a significant delay before it would reply, and
> an especially long delay for one of them, so I used a number of threads,
> each one reading a URL from a queue, performing the download, and then
> reading the next URL, until there were none left (actually, until it
> read the sentinel None, which it put back for the other threads).
>
> The result?
>
> Shorter total download time because it could be downloading one webpage
> while waiting for another to reply.
>
> (Of course, I had to make sure that I didn't have too many threads,
> because that might've put too many demands on the website, not a nice
> thing to do!)

A fair few of my scripts require multiple uploads and downloads, and I
always use threads to do so. I was using an API which was quite badly
designed, and I got a list of UserId's from one API call then had to
query another API method to get info on each of the UserId's I got
from the first API. I could have used twisted, but in the end I just
made a simple thread pool (30 threads and an in/out Queue). The
result? A *massive* speedup, even with the extra complications of
waiting until all the threads are done then grouping the results
together from the output Queue.

Since then I always use native threads.

Tom