Fetching websites with Python
Andrew Bennetts
andrew-pythonlist at puzzling.org
Thu Apr 1 06:33:19 EST 2004
On Wed, Mar 31, 2004 at 07:33:45PM +0200, Markus Franz wrote:
> Hi.
>
> How can I grab websites with a command-line python script? I want to start
> the script like this:
>
> ./script.py ---xxx--- http://www.address1.com http://www.address2.com
> http://www.address3.com
>
> The script should load these 3 websites (or more if specified) in parallel
> (may be processes? threads?) and show their contents seperated by ---xxx---.
> The whole output should be print on the command-line. Each website should
> only have 15 seconds to return the contents (maximum) in order to avoid a
> never-ending script.
>
> How can I do this?
You could use Twisted <http://twistedmatrix.com>:
from twisted.internet import reactor
from twisted.web.client import getPage
import sys
def gotPage(page):
print seperator
print page
def failed(failure):
print seperator + ' FAILED'
failure.printTraceback()
def decrement(ignored):
global count
count -= 1
if count == 0:
reactor.stop()
seperator = sys.argv[1]
urlList = sys.argv[2:]
count = len(urlList)
for url in urlList:
getPage(url, timeout=15).addCallbacks(gotPage, failed).addBoth(decrement)
reactor.run()
It will grab the sites in parallel, printing them in the order they arrive,
and doesn't use multiple processes, or multiple threads :)
-Andrew.
More information about the Python-list
mailing list