Making HTTP requests using Twisted

Manlio Perillo manlio_perilloNO at SPAMlibero.it
Tue Jul 11 13:53:20 EDT 2006


rzimerman ha scritto:
> I'm hoping to write a program that will read any number of urls from
> stdin (1 per line), download them, and process them. So far my script
> (below) works well for small numbers of urls. However, it does not
> scale to more than 200 urls or so, because it issues HTTP requests for
> all of the urls simultaneously, and terminates after 25 seconds.
> Ideally, I'd like this script to download at most 50 pages in parallel,
> and to time out if and only if any HTTP request is not answered in 3
> seconds. What changes do I need to make?
> 

Take a look at
http://svn.twistedmatrix.com/cvs/trunk/doc/core/examples/stdiodemo.py?view=markup&rev=15456

And read
http://twistedmatrix.com/documents/current/api/twisted.web.client.HTTPClientFactory.html

You can pass a timeout to the constructor.

To download at most 50 pages in parallel you can use a download queue.

Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(object):
    SIZE = 50

    def init(self):
        self.requests = [] # queued requests
        self.deferreds = [] # waiting requests

    def addRequest(self, url, timeout):
        if len(self.deferreds) >= sels.SIZE:
            # wait for completion of all previous requests
            DeferredList(self.deferreds
                         ).addCallback(self._callback)
            self.deferreds = []

            # queue the request
            deferred = Deferred()
            self.requests.append((url, timeout, deferred))

            return deferred
        else:
            # execute the request now
            deferred = getPage(url, timeout=timeout)
            self.deferreds.append(deferred)

            return deferred

    def _callback(self):
        if len(self.requests) > self.SIZE:
            queue = self.requests[:self.SIZE]
            self.requests = self.requests[self.SIZE:]
        else:
            queue = self.requests[:]
            self.requests = []

        # execute the requests
        for (url, timeout, deferredHelper) in queue:
            deferred = getPage(url, timeout=timeout)
            self.deferreds.append(deferred)

            deferred.chainDeferred(deferredHelper)




Regards  Manlio Perillo



More information about the Python-list mailing list