search an entire website given the homepage URL

Terry Reedy tjreedy at udel.edu
Tue Apr 25 19:55:08 EDT 2006


"Bell, Kevin" <kevin.bell at slcgov.com> wrote in message 
news:2387F0EED10A4545A840B231BBAAC7226082CC at slcimail1.slcgov.com...
> I would like some feedback about my actual intention though, which is to
> scrape local newspaper websites for the names of people that I work
> with.  Twice this month, colleagues have unknowingly been in the
> newspaper, and only became aware of it because someone stumbled across
> the line in the article.  To write a script that would crawl around
> testing for my own name, or that of my colleagues, wouldn't seem uncouth
> to me, but I'm new at this stuff.  It seems impolite for newspapers to
> use someone's name without informing them of it, for sure, but you can't
> count on journalists to call you up.  Would this application of a spider
> be impolite?

If the site has an index, I would use that.
If the site has pages at fixed urls accessible to public indexes (Google, 
Yahoo, etc) I would use one of those.  (Google, at least, will search a 
specific site.)
If the site has a robots.txt file requesting robots and spiders to restrict 
themselvres, I would honor the request.
Failing the above, I might write something to once a day, during off hours, 
download and examine articles in the appropriate category.

tjr






More information about the Python-list mailing list