How to set a timeout?

Steve Holden sholden at holdenweb.com
Wed Jul 18 10:19:51 EDT 2001


"Matthias Huening" <mhuening at zedat.fu-berlin.de> wrote in message
news:9j446n$lom9k$1 at fu-berlin.de...
> I have a function like this:
>
> -----
> def read_pages(urllist):
>     res = {}
>     for x in urllist:
>         try:
>             res[x] = urllib.urlopen(x).read()
>         except:
>             pass
>     return(res)
> -----
>
> I now want the following behaviour: try to catch the webpage for 3
seconds;
> if it takes longer just skip this one and move on to the next. How can I
> achieve this?
>
> Thanks, Matthias
>
> PS Other suggestions for speeding up the process of reading lets say 100
> webpages are welcome...
>
This one comes in the category of "other suggestions", but it might nullify
your concerns about download time, thereby solving your whole problem. Sam
Rushing, the author of the asyncore module, shows on his web site
(www.nightmare.com) how to asynchronously read as many web pages as you
like. The code goes something like this:

import asyncore
import socket

class http_client (asyncore.dispatcher):

    def __init__ (self, host, path, cnum):
        asyncore.dispatcher.__init__ (self)
        self.path = path
        self.cnum = cnum
        self.host = host
        self.wflag = 1
        self.create_socket (socket.AF_INET, socket.SOCK_STREAM)
        self.connect ((self.host, 80))

    def handle_connect (self):
        self.send ('GET %s HTTP/1.0\r\n\r\n' % self.path)
        print "Channel:", self.cnum, \
                "Sent request for", self.path, "to", self.host
        self.wflag = 0

    def handle_read (self):
        data = self.recv (8192)
        print "Channel:", self.cnum, "Received", len(data), "bytes"

    def handle_write (self):
        print "Channel:", self.cnum, "was writable"

    def writable(self):
        return self.wflag

import sys
import urlparse
cnum = 0
for url in sys.argv[1:]:
    parts = urlparse.urlparse (url)
    if parts[0] != 'http':
        raise ValueError, "HTTP URL's only, please"
    else:
        cnum += 1
        host = parts[1]
        path = parts[2]
        http_client (host, path, cnum)
asyncore.loop()

Now in this particular example each http_client object just prints debug
information about its input. You should easily be able to modify it to do
something sensible with the downloaded HTML, I should think. Try running the
program with several URLs on the command line, you'll soon see how it works.
Then read the documentation (such as it is), and download Sam's code for
more information on this fascinating system.

Sam's ideas are not as well-known as they should be, and I'm sure you'll see
a big improvement over just reading the HTML streams one by one if you go
with this code. Since each http_client is an object, it's quite easy to
parse each HTML stream independent of the others. Also, since they are
handled in parallel, you will no longer be so concerned about pages which
take longer than 3 seconds, since they won't hold the others up!

regards
 Steve
--
www.holdenweb.com






More information about the Python-list mailing list