read all available pages on a Website

Carlos Ribeiro carribeiro at gmail.com
Mon Sep 13 09:24:11 EDT 2004


Brad,

Just to clarify something other posters have said. Automatic crawling
of websites is not welcome primarily because of performance concerns.
It also may be regarded by some webmasters a a kind of abuse, because
the crawler is doing 'hits' and copying material for unknown reasons,
but is not seeing any ad or generating revenue. Some sites even go to
the extent of blocking access from your IP, or even for your entire IP
range, when they detect this type of behavior. Because of this, there
is a very simple procol involving a file called "robots.txt". Whenever
your robot first enter into a site, it must check this file and follow
the instructions there. It will tell you what you can do in that
website.

There are also other few catches that you need to be aware of. First,
some sites don't have links pointing to all their pages, so it's never
possible to be completely sure about having read *all* pages. Also,
some sites have link embedded into scripts. It's not a recommended
practice, but it's common at some sites, and it may cause you
problems. And finally, there are situations where your robot may be
stuck into an "infinite site"; that's because some sites generate
pages dinamically, and your robot may end up fetching page after page
and never get out of the site. So, if you want a generic solution to
crawl any site you desire, you have to check out these issues.


Best regards,


-- 
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: carribeiro at gmail.com
mail: carribeiro at yahoo.com



More information about the Python-list mailing list