Fetching websites with Python

Thu Apr 1 06:33:19 EST 2004

On Wed, Mar 31, 2004 at 07:33:45PM +0200, Markus Franz wrote:
> Hi.
> 
> How can I grab websites with a command-line python script? I want to start
> the script like this:
> 
> ./script.py ---xxx--- http://www.address1.com http://www.address2.com
> http://www.address3.com
> 
> The script should load these 3 websites (or more if specified) in parallel
> (may be processes? threads?) and show their contents seperated by ---xxx---.
> The whole output should be print on the command-line. Each website should
> only have 15 seconds to return the contents (maximum) in order to avoid a
> never-ending script.
> 
> How can I do this?

You could use Twisted <http://twistedmatrix.com>:

    from twisted.internet import reactor
    from twisted.web.client import getPage
    import sys

    def gotPage(page):
        print seperator
        print page

    def failed(failure):
        print seperator + ' FAILED'
        failure.printTraceback()

    def decrement(ignored):
        global count
        count -= 1
        if count == 0:
            reactor.stop()

    seperator = sys.argv[1]
    urlList = sys.argv[2:]
    count = len(urlList)
    for url in urlList:
        getPage(url, timeout=15).addCallbacks(gotPage, failed).addBoth(decrement)

    reactor.run()

It will grab the sites in parallel, printing them in the order they arrive,
and doesn't use multiple processes, or multiple threads :)

-Andrew.