[Tutor] Threads

orbitz orbitz at ezabel.com
Wed Nov 17 03:06:44 CET 2004


My apologies, should be len(URLS) not len(URLS) - 1

orbitz wrote:

> Not only is things like waiting for headers a major issue, but simply 
> resolving the name and connecting.  What if your DNS goes down in 
> mid-download. It could take a long time to timeout while trying to 
> connect to your DNS, and none of your sockets will be touched, select 
> or not.  So if we are going to use blocking sockets we might as well 
> go all the way.
>
> Here is a simple twisted example of downloading 3 sites, printing them 
> to stdout, and exiting, it probably won't make much sense but it's 
> 100% non blocking at least:)
>
> from twisted.web import client
> from twisted.Internet import reactor
>
> from urllib2 import urlparse
>
> def _handlePage(result):
>  """The result is the contents of the webpage"""
>  global num_downloaded
>  print result
>  num_downloaded += 1
>  if num_downloaded == len(URLS) - 1:
>    reactor.stop()
>
> URLS = ['http://www.google.com/', 'http://www.yahoo.com/', 
> 'http://www.python.org/']
> num_downloaded = 0
>
> for i in URLS:
>  parsed = urlparse.urlsplit(i)
>  f = client.HTTPClientFactory(parsed[2])
>  f.host = parsed[1]
>  f.deferred.addCallback(_handlePage)
>  reactor.connectTCP(parsed[1], 80, f)
>
> reactor.run()
>
>
> All this does is download each page, print it out, and when that many 
> url's has been processed, stop the program (reactor.stop).  This does 
> not handle errors or any exceptional situations.
>
> Danny Yoo wrote:
>
>> On Tue, 16 Nov 2004, orbitz wrote:
>>
>>  
>>
>>> urllib is blocking, so you can't really use it wiht non blocking code.
>>> the urlopen functio could take awhile, and then even if data is on the
>>> socket then it will still block for te read most likely which is not
>>> going to help you. One is going to have to use a non blocking url 
>>> api in
>>> order to make the most of their time.
>>>   
>>
>>
>>
>> Hi Orbitz,
>>
>>
>> Hmmm!  Yes, you're right: the sockets block by default.  But, when we 
>> try
>> to read() a block of data, select() can tell us which ones will
>> immediately block and which ones won't.
>>
>>
>> The real-world situation is actually a bit complicated.  Let's do a test
>> to make things more explicit and measurable.
>>
>>
>> For this example, let's say that we have the following 'hello.py' CGI:
>>
>> ###
>> #!/usr/bin/python
>> import time
>> import sys
>> print "Content-type: text/plain\n\n"
>> sys.stdout.flush()
>>
>> print "hello world";
>> time.sleep(5)
>> print "goodbye world"
>> ###
>>
>>
>> I'll be accessing this cgi from the url 
>> "http://localhost/~dyoo/hello.py".
>> I'm also using Apache 2.0 as my web server.  Big note: there's a flush()
>> after the content-stream header.  This is intentional, and will be
>> significant later on in this post.
>>
>>
>>
>> I then wrote the following two test programs:
>>
>> ###
>> ## test1.py
>> from grab_pages import PageGrabber
>> from StringIO import StringIO
>> pg = PageGrabber()
>> f1, f2, f3 = StringIO(), StringIO(), StringIO()
>> pg.add("http://localhost/~dyoo/hello.py", f1)
>> pg.add("http://localhost/~dyoo/hello.py", f2)
>> pg.add("http://localhost/~dyoo/hello.py", f3)
>> pg.writeOutAllPages()
>> print f1.getvalue()
>> print f2.getvalue()
>> print f3.getvalue()
>> ###
>>
>>
>> ###
>> ## test2.py
>> import urllib
>> print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>> print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>> print urllib.urlopen("http://localhost/~dyoo/hello.py").read()
>> ###
>>
>>
>> test1 uses the PageGrabber class we wrote earlier, and test2 uses a
>> straightforward approach.
>>
>>
>> If we start timing the perfomance of test1.py and test2.py, we do see a
>> difference between the two, since test1 will try to grab the pages in
>> parallel, while test2 will do it serially:
>>
>>
>> ###
>> [dyoo at shoebox dyoo]$ time python test1.py
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> real    0m5.106s
>> user    0m0.043s
>> sys    0m0.011s
>>
>> [dyoo at shoebox dyoo]$ time python test2.py
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> real    0m15.107s
>> user    0m0.044s
>> sys    0m0.007s
>> ###
>>
>>
>> So for this particular example, we're getting good results: test1 takes
>> about 5 seconds, while test2 takes 15.  So the select() code is doing
>> pretty ok so far, and does show improvement over the straightforward
>> approach.  Isn't this wonderful?  *grin*
>>
>>
>> Well, there's bad news.
>>
>>
>> The problem is that, as you highlighted, the urllib.urlopen() function
>> itself can block, and that's actually a very bad problem in 
>> practice.  In
>> particular, it blocks until it sees the end of the HTTP headers, 
>> since it
>> depends on Python's 'httplib' module.
>>
>> If we take out the flush() out of our hello.py CGI:
>>
>> ###
>> #!/usr/bin/python
>> import time
>> import sys
>> print "Content-type: text/plain\n\n"
>> print "hello world";
>> time.sleep(5)
>> print "goodbye world"
>> ###
>>
>>
>> then suddenly things go horribly awry:
>>
>> ###
>> [dyoo at shoebox dyoo]$ time python test1.py
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> hello world
>> goodbye world
>>
>>
>> real    0m15.113s
>> user    0m0.047s
>> sys    0m0.006s
>> ###
>>
>> And suddenly, we do no better than with the serial version!
>>
>>
>> What's happening is that the web server is buffering the output of 
>> its CGI
>> programs.  Without the sys.stdout.flush(), it's likely that the web 
>> server
>> doesn't send out anything until the whole program is complete.  But
>> because urllib.urlopen() returns only after seeing the header block from
>> the HTTP response, it actually ends up waiting until the whole program's
>> done.
>>
>>
>> Not all CGI's have been carefully written to output its HTTP headers 
>> in a
>> timely manner, so urllib.urlopen()'s blocking behavior is a 
>> show-stopper.
>> This highlights the need for a framework that's built with nonblocking,
>> event-driven code as a pervasive concept.  Like... Twisted!  *grin*
>>
>> Does anyone want to cook up an example with Twisted to show how the
>> page-grabbing example might work?
>>
>>
>>
>> Hope this helps!
>>
>>
>>  
>>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>



More information about the Tutor mailing list