search an entire website given the homepage URL

Bell, Kevin kevin.bell at slcgov.com
Tue Apr 25 18:08:43 EDT 2006


Fredrik wrote:
to grab entire sites ?

try doing that on a commercial data provider's site, and chances are
that you'll end up being banned (or sued) within hours ...
-------------

Me:

Nope, I never said that to start with...

Well I certainly am learning a lot.  I never said I intended to download
anyone's entire website, as was assumed, but it's been fun to see how
folks feel about it anyway!

I would like some feedback about my actual intention though, which is to
scrape local newspaper websites for the names of people that I work
with.  Twice this month, colleagues have unknowingly been in the
newspaper, and only became aware of it because someone stumbled across
the line in the article.  To write a script that would crawl around
testing for my own name, or that of my colleagues, wouldn't seem uncouth
to me, but I'm new at this stuff.  It seems impolite for newspapers to
use someone's name without informing them of it, for sure, but you can't
count on journalists to call you up.  Would this application of a spider
be impolite?




Bell, Kevin wrote:
>>use a search engine (try the search box in the upper right corner).
> 
> 
>>using a spider to download the entire site just so you can "search
> 
> through >it" is bloody impolite.
> 
> Really?  I'd argue that's impolite only if you're an impolite person 
> with a rude agenda, which is not what I had in mind, but thanks for 
> the ethics lecture as well as the pointer ; )  I assure you that I 
> harbor no nefarious scheme.  Isn't it common for folks to watch the 
> stock market, or real estate listings, for example?
> 
> I'll look into to tools you mentioned, and thanks again!
> 
> 
I think Fredrik's right: the intarweb is supposed to be distributed, not
live on your desktop. Folk who watch the stock market don't download
twenty years' worth of data in one afternoon, they generally subscribe
to real-time feeds that are relatively low volume.

regards
  Steve





More information about the Python-list mailing list