Web Crawling/Threading and Things That Go Bump in the Night

Fri Aug 4 14:12:37 EDT 2006

Rem, what OS are you trying this on? Windows XP SP2 has a limit of
around 40 tcp connections per second...

Remarkable wrote:
> Hello all
>
> I am trying to write a reliable web-crawler. I tried to write my own
> using recursion and found I quickly hit the "too many sockets" open
> problem. So I looked for a threaded version that I could easily extend.
>
> The simplest/most reliable I found was called Spider.py (see attached).
>
> At this stage I want a spider that I can point at a site, let it do
> it's thing, and reliable get a callback of sorts... including the html
> (for me to parse), the url of the page in question (so I can log it)
> and the urls-found-on-that-page (so I can strip out any ones I really
> don't want and add them to the "seen-list".
>
>
> Now, this is my question.
>
> The code above ALMOST works fine. The crawler crawls, I get the data I
> need BUT... every now and again the code just pauses, I hit control-C
> and it reports an error as if it has hit an exception and then carries
> on!!! I like the fact that my spider_usage.py file has the minimum
> amount of spider stuff in it... really just a main() and handle()
> handler.
>
> How does this happen... is a thread being killed and then a new one is
> made or what? I suspect it may have something to do with sockets timing
> out, but I have no idea...
>
> By the way on small sites (100s of pages) it never gets to the stall,
> it's on larger sites such as Amazon that it "fails"
>
> This is my other question
>
> It would be great to know, when the code is stalled, if it is doing
> anything... is there any way to even print a full stop to screen?
>
> This is my last question
>
> Given python's suitability for this sort of thing (isn't google written
> in it?) I can't believe that that there isn't a kick ass crawler
> already out there...
>
> regards
>
> tom
>
> http://www.theotherblog.com/Articles/2006/08/04/python-web-crawler-spider/