read all available pages on a Website

Alex Martelli aleaxit at yahoo.com
Mon Sep 13 05:09:08 EDT 2004


Leif K-Brooks <eurleif at ecritters.biz> wrote:

> Tim Roberts wrote:
> > Brad Tilley <bradtilley at usa.net> wrote:
> > 
> >>Is there a way to make urllib or urllib2 read all of the pages on a Web
> >>site?
> > By the way, there are many web sites for which this sort of behavior is not
> > welcome.
> 
> Any site that didn't want to be crawled would most likely use a 
> robots.txt file, so you could check that before doing the crawl.

Python's Tools/webchecker/ directory has just the code you need for all
of this.  The directory is part of the Python source distribution, but
it's all pure Python code, so, if your distribution is binary and omits
that directory, just download the Python source distribution, unpack it,
and there you are.


Alex



More information about the Python-list mailing list