read all available pages on a Website

Tim Roberts timr at probo.com
Mon Sep 13 02:13:49 EDT 2004


Brad Tilley <bradtilley at usa.net> wrote:

>Is there a way to make urllib or urllib2 read all of the pages on a Web 
>site? For example, say I wanted to read each page of www.python.org into 
>separate strings (a string for each page). The problem is that I don't 
>know how many pages are at www.python.org. How can I handle this?

You have to parse the HTML to pull out all the links and images and fetch
them, one by one.  sgmllib can help with the parsing.  You can multithread
this, if performance in an issue.

By the way, there are many web sites for which this sort of behavior is not
welcome.
-- 
- Tim Roberts, timr at probo.com
  Providenza & Boekelheide, Inc.



More information about the Python-list mailing list