Web robot "freeze" solved, perhaps (was RE: Debugging on windows via print statements -- reliable?)

Mon Mar 25 11:07:03 EST 2002

> -----Original Message-----
> From: python-list-admin at python.org
> [mailto:python-list-admin at python.org]On Behalf Of Aahz
> Sent: Monday, March 25, 2002 7:06 AM
> To: python-list at python.org
> Subject: Re: Debugging on windows via print statements -- reliable?

[snip]

> Ah.  You're probably getting stuck in urllib.  You need to use
> http://www.timo-tasi.org/python/timeoutsocket.py

I don't think that was the problem (urlretrieve showed me that the file was
retrieved before the "freeze") but that's a function that I absolutely
wanted to add shortly, so a big thank-you.  Tim Bray, who wrote one of the
first Web robots ever, had recently emphasized the importance of "kind"
time-outs for robots on the robots mailing list, so this was very much on my
mind.

What appears to have finally solved this problem was a fix to a function I
was calling to parse each new document.

The function is below.  I had earlier assigned self.onePage to be the page
contents in the calling function, then decided to pass it to this function
instead, but didn't change newParser (which is no longer an appropriate
name, come to think of it) itself appropriately.

    def newParser(self,onePage):
        self.myParser.reset()
        self.onePage = string.replace(self.onePage,' \n',' ')
        self.onePage = string.replace(self.onePage,'\n',' ')
        self.myParser.feed(self.onePage)
        self.myParser.close()

An extremely quick fix was to add "self.onePage = onePage" and then, of
course, clean up the thing by getting rid of all of the self references for
onePage.

I suppose it's obvious that I was using self.onePage in the calling
function, or this wouldn't have worked at all.  I haven't quite sorted out
what was happening or why it was intermittently misbehaving.  But I have now
run a few hundred iterations with no problems.

I added timeoutsocket, but no timeouts have happened yet.  However, I'm
hitting a very robust site and my ISP is incredibly reliable.  Most of the
pages only take .2 seconds to be retrieved, rarely more than 1 second.

Whew.

Thanks, all!

Nick