[Chicago] how to use multithread to download?

Matt Bone thatmattbone at gmail.com
Sat Jun 18 14:35:16 CEST 2011


You may be interested in this pycon talk:
http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-backup-is-hard-let-s-go-shopping-4897842

Which discusses how to combine asyncore and httplib.  Relevant bits
start at  ~11 minutes in.

--matt

On Fri, Jun 17, 2011 at 10:15 AM, Dale Sedivec <dale at codefu.org> wrote:
> 2011/6/17 守株待兔 <1248283536 at qq.com>:
>> i have written a program to download an online book:
>> http://www.network-theory.co.uk/docs/pytut/
>>
>> import time
>> import urllib
>> import lxml.html
>> import os
>> time1=time.time()
>> os.mkdir('/tmp/python')
>> down='http://www.network-theory.co.uk/docs/pytut/'
>> file=urllib.urlopen(down).read()
>> root=lxml.html.fromstring(file)
>> tnodes = root.xpath("//div[@class='main']//ul/li/a")
>> for x in tnodes:
>>   url='http://www.network-theory.co.uk/docs/pytut/'+x.get('href')
>>   name=x.text
>>   myfile=open('/tmp/python/'+name,'a')
>>   page=urllib.urlopen(url).read()
>>   myfile.write(page)
>>   myfile.close()
>> time2=time.time()
>> print time2-time1
>>
>> it's slow , would  you  mind to revise it with multithread??
>
> Are you sure that the person running this site would welcome lots of
> parallel hits coming from you to download the book they're giving
> away?  My initial reaction is that you should not parallelize this
> task as a matter of politeness.  I have to believe your bottleneck
> here is the HTTP request/response; there's nothing super CPU or I/O
> intensive on your side.  I'd be surprised if there are more than 150
> links on that page.  It can't take _that_ long to download them
> sequentially, right?  I suspect many administrators would not welcome
> a big flurry of parallel hits to their web site--especially not to
> download a book they're giving away in the first place.
>
> Approaching this solely as a hypothetical exercise for learning
> parallel processing in Python, I think I'd use something like
> multiprocessing.Pool from the standard library (Python 2.6 or later).
> Probably Pool.map calling a tiny function to fetch and store each URL
> (i.e. most of the inside of that loop).  Maybe using a smallish
> chunksize to Pool.map.  Note that this will actually use separate
> processes, not threads, but I don't see how that would matter in this
> case.
>
> But please don't use this knowledge to download this book in parallel
> unless you know the people that run that site wouldn't mind.
>
> Dale
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>


More information about the Chicago mailing list