Do I have to use threads?

Thu Jan 7 12:38:46 EST 2010

Jorgen Grahn wrote:
> On Thu, 2010-01-07, Marco Salden wrote:
>> On Jan 6, 5:36 am, Philip Semanchuk <phi... at semanchuk.com> wrote:
>>> On Jan 5, 2010, at 11:26 PM, aditya shukla wrote:
>>>
>>>> Hello people,
>>>> I have 5 directories corresponding 5  different urls .I want to  
>>>> download
>>>> images from those urls and place them in the respective  
>>>> directories.I have
>>>> to extract the contents and download them simultaneously.I can  
>>>> extract the
>>>> contents and do then one by one. My questions is for doing it  
>>>> simultaneously
>>>> do I have to use threads?
>>> No. You could spawn 5 copies of wget (or curl or a Python program that  
>>> you've written). Whether or not that will perform better or be easier  
>>> to code, debug and maintain depends on the other aspects of your  
>>> program(s).
>>>
>>> bye
>>> Philip
>> Yep, the more easier and straightforward the approach, the better:
>> threads are always (programmers')-error-prone by nature.
>> But my question would be: does it REALLY need to be simultaneously:
>> the CPU/OS only has more overhead doing this in parallel with
>> processess. Measuring sequential processing and then trying to
>> optimize (e.g. for user response or whatever) would be my prefered way
>> to go. Less=More.
> 
> Normally when you do HTTP in parallell over several TCP sockets, it
> has nothing to do with CPU overhead. You just don't want every GET to
> be delayed just because the server(s) are lazy responding to the first
> few ones; or you might want to read the text of a web page and the CSS
> before a few huge pictures have been downloaded.
> 
> His "I have to [do them] simultaneously" makes me want to ask "Why?".
> 
> If he's expecting *many* pictures, I doubt that the parallel download
> will buy him much.  Reusing the same TCP socket for all of them is
> more likely to help, especially if the pictures aren't tiny. One
> long-lived TCP connection is much more efficient than dozens of
> short-lived ones.
> 
> Personally, I'd popen() wget and let it do the job for me.
> 
 From my own experience:

I wanted to download a number of webpages.

I noticed that there was a significant delay before it would reply, and
an especially long delay for one of them, so I used a number of threads,
each one reading a URL from a queue, performing the download, and then
reading the next URL, until there were none left (actually, until it
read the sentinel None, which it put back for the other threads).

The result?

Shorter total download time because it could be downloading one webpage
while waiting for another to reply.

(Of course, I had to make sure that I didn't have too many threads,
because that might've put too many demands on the website, not a nice
thing to do!)