read all available pages on a Website

Mon Sep 13 05:09:08 EDT 2004

Leif K-Brooks <eurleif at ecritters.biz> wrote:

> Tim Roberts wrote:
> > Brad Tilley <bradtilley at usa.net> wrote:
> > 
> >>Is there a way to make urllib or urllib2 read all of the pages on a Web
> >>site?
> > By the way, there are many web sites for which this sort of behavior is not
> > welcome.
> 
> Any site that didn't want to be crawled would most likely use a 
> robots.txt file, so you could check that before doing the crawl.

Python's Tools/webchecker/ directory has just the code you need for all
of this.  The directory is part of the Python source distribution, but
it's all pure Python code, so, if your distribution is binary and omits
that directory, just download the Python source distribution, unpack it,
and there you are.

Alex