Why does this fail?

Mon Jan 5 01:09:05 EST 2004

I could not help replying to this thread...

There are already quite a lot of spider programs existing
in Python. I am the author of one of the first programs of
the kind, called HarvestMan. It is multithreaded and has 
many features for downloading websites, checking links etc. 
You can get it from the HarvestMan homepage at 
http://harvestman.freezope.org. HarvestMan is quite
comprehensive and is a bit more than a link checker or 
web crawler. My feeling is that it is not easy to understand 
for a  Python beginner though the program is distributed
as source code in true Python tradition.

  If you want something simpler, try spider.py. You can get
information on it from the PyPi pages.

  My point was that, there is nothing to gain from re-inventing
the wheel again and again. Spider programs have been written in
Python, so you should try to use them rather than writing code
from scratch. If you think that you are having new ideas, then
take the code of HarvestMan(or spider) and customize it or 
improve on it. I will be happy to merge the changes back in the
code if I think they improve the program, if it is for HarvestMan. 

  This is the main reason why developers release programs as
opensource. Help the community, and help yourselves. Re-inventing
the wheel is perhaps not the way to go.

best regards

-Anand

"Dave Murray" <dlmurray at micro-net.com> wrote in message news:<vvhkngs7oachad at corp.supernews.com>...
> Thank you all, this is a hell of a news group. The diversity of answers
> helped me with some unasked questions, and provided more elegant solutions
> to what I thought that I had figured out on my own. I appreciate it.
> 
> It's part of a spider that I'm working on to verify my own (and friends) web
> page and check for broken links. Looks like making it follow robot rules
> (robots.txt and meta field exclusions) is what's left.
> 
> I have found the library for html/sgml to be not very robust. Big .php and
> .html with lot's of cascades and external references break it very
> ungracefully (sgmllib.SGMLParseError: expected name token). I'd like to be
> able to trap that stuff and just move on to the next file, accepting the
> error. I'm reading in the external links and printing the title as a sanity
> check in addition to collecting href anchors. This problem that I asked
> about reared it's head when I started testing for a robots.txt file, which
> may or may not exist.
> 
> The real point is to learn the language. When a new grad wrote a useful
> utility at work in Python faster than I could have written it in C I decided
> that I needed to learn Python. He's very sharp but he sold me on the
> language too. Since I often must write utilities, Python seems to be a very
> good thing since I normally don't have much time to kill on them.
> 
> Dave