[Tutor] Threads

Tue Nov 16 02:23:55 CET 2004

On Mon, 15 Nov 2004, Terry Carroll wrote:

> On Mon, 15 Nov 2004, orbitz wrote:
>
> > I guess you didn't read what I said. I suggested using
> > async/non-blocking sockets so you may have multiple downloads going.
> > Feel free to google what they are.
>
> I read it; I misunderstood it.

Hi everyone,

[text cut]

I have to admit: I haven't done too much asynchronous stuff myself yet.
A learning experience!  *grin*

Let's look at an example of doing something like asynchronous http
downloads, using the socket.select() call.  It sounds like the problem is
to try retrieving a bunch of pages by url simulaneously.  We try doing
things in parallel to improve network throughput, and to account for
certain web pages coming off slower than others.

We can use urllib.urlopen() to grab a bunch of web pages.  Since that
object looks just like a file, we can using it as part of a 'select' event
loop.  For example:

###
>>> import urllib
>>> import select
>>> f = urllib.urlopen("http://python.org")
>>> ready_in, ready_out, ready_exc = select.select([f], [], [])
###

When we use a select.select(), what comes back are the file objects that
are ready to be read().  select.select() is useful because it returns us
all the files that have some content to read.

Here's some demonstration code of using select together with the file-like
objects that come off of urlopen():

######
"""A small demonstration on grabbing pages asynchronously.

Danny Yoo (dyoo at hkn.eecs.berkeley.edu)

urllib.urlopen() provides a file-like object, but we can still get at
the underlying socket.

"""

import select
import sys
import urllib

class PageGrabber:
    def __init__(self):
        self._urls = {}
        self._inFiles = {}
        self._outFiles = {}

    def add(self, url, outputFile):
        """Adds a new url to be grabbed.  We start writing the output
        to the outputFile."""
        openedFile = urllib.urlopen(url)
        fileno = openedFile.fileno()
        self._inFiles[fileno] = openedFile
        self._urls[fileno] = url
        self._outFiles[fileno] = outputFile

    def writeOutAllPages(self):
        """Waits until all the url streams are written out to their
        respective outFiles."""
        while self._urls:
            ins, outs, errs = select.select(self._inFiles.keys(), [], [])
            for in_fileno in ins:
                all_done = self._writeBlock(in_fileno)
                if all_done:
                    self._dropUrl(in_fileno)

    def _dropUrl(self, in_fileno):
        del self._urls[in_fileno]
        self._inFiles[in_fileno].close()
        del self._inFiles[in_fileno]
        del self._outFiles[in_fileno]

    def _writeBlock(self, in_fileno, block_size=1024):
        """Write out the next block.  If no more blocks are available,
        returns True.  Else, returns false."""
        next_block = self._inFiles[in_fileno].read(block_size)
        self._outFiles[in_fileno].write(next_block)
        if next_block:
            return False
        else:
            return True

######################################################################
## The rest here is just test code.  I really should be using unittest
## but I got impatient.  *grin*

class TracedFile:
    """A small file just to trace when things are getting written.
    Used just for debugging purposes"""
    def __init__(self, name, file):
        self.name = name
        self.file = file

    def write(self, bytes):
        sys.stderr.write("%s is writing.\n" % self.name)
        self.file.write(bytes)

if __name__ == '__main__':
    p = PageGrabber()
    p.add("http://python.org",
          TracedFile("python.org", sys.stdout))
    p.add("http://www.pythonware.com/daily/",
          TracedFile("daily python url", sys.stdout))
    p.writeOutAllPages()
######

The code is really rough and not factored well yet, but it's a starting
point.  I hope this helps!