A problem while using urllib

Johnny Lee johnnyandfiona at hotmail.com
Tue Oct 11 05:57:22 EDT 2005


Hi,
   I was using urllib to grab urls from web. here is the work flow of
my program:

1. Get base url and max number of urls from user
2. Call filter to validate the base url
3. Read the source of the base url and grab all the urls from "href"
property of "a" tag
4. Call filter to validate every url grabbed
5. Continue 3-4 until the number of url grabbed gets the limit

   In filter there is a method like this:

--------------------------------------------------
# check whether the url can be connected
def filteredByConnection(self, url):
   assert url

   try:
      webPage = urllib2.urlopen(url)
   except urllib2.URLError:
      self.logGenerator.log("Error: " + url + " <urlopen error timed
out>")
      return False
   except urllib2.HTTPError:
      self.logGenerator.log("Error: " + url + " not found")
      return False
   self.logGenerator.log("Connecting " + url + " successed")
   webPage.close()
   return True
----------------------------------------------------

   But every time when I ran to the 70 to 75 urls (that means 70-75
urls have been tested via this way), the program will crash and all the
urls left will raise urllib2.URLError until the program exits. I tried
many ways to work it out, using urllib, set a sleep(1) in the filter (I
thought it was the massive urls crashed the program). But none works.
BTW, if I set the url from which the program crashed to base url, the
program will still crashed at the 70-75 url. How can I solve this
problem? thanks for your help

Regards,
Johnny




More information about the Python-list mailing list