[Spambayes-checkins] spambayes/testtools urlslurper.py,1.2,1.3

Mark Hammond mhammond at users.sourceforge.net
Thu May 1 17:09:23 EDT 2003


Update of /cvsroot/spambayes/spambayes/testtools
In directory sc8-pr-cvs1:/tmp/cvs-serv2431

Modified Files:
	urlslurper.py 
Log Message:
I may have broken a global in my last checkin - sorry about that.  Cache works again.

Only slurp HTML content - avoid gifs etc

Index: urlslurper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/testtools/urlslurper.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** urlslurper.py	1 May 2003 11:46:51 -0000	1.2
--- urlslurper.py	1 May 2003 23:09:20 -0000	1.3
***************
*** 116,119 ****
--- 116,123 ----
  url_dict = {}
  
+ # Exception to raise when we should skip the URL
+ class IgnoreURLException(Exception):
+     pass
+ 
  def body_tokens(msg):
      tokens = Tokenizer().tokenize_body(msg)
***************
*** 164,167 ****
--- 168,175 ----
                      try:
                          f = urllib2.urlopen(url)
+                         # Anything that isn't text/html is ingored
+                         content_type = f.headers.get('content-type')
+                         if not content_type.startswith("text/html"):
+                             raise IgnoreURLException("content type='%s'" % (content_type,))
                          page = f.read()
                          f.close()
***************
*** 170,176 ****
                          if options["globals", "verbose"]:
                              print >> sys.stderr, "Slurped."
!                     except (IOError, socket.error):
                          url_dict[url] = 0.5
!                         print >> sys.stderr, "Couldn't get", url
                      if not url_dict.has_key(url) or url_dict[url] != 0.5:
                          # Create a fake Message object since Tokenizer is
--- 178,184 ----
                          if options["globals", "verbose"]:
                              print >> sys.stderr, "Slurped."
!                     except (IgnoreURLException, IOError, socket.error), details:
                          url_dict[url] = 0.5
!                         print >> sys.stderr, "Couldn't get %s (%s)" % (url, details)
                      if not url_dict.has_key(url) or url_dict[url] != 0.5:
                          # Create a fake Message object since Tokenizer is
***************
*** 222,225 ****
--- 230,234 ----
      if os.path.exists(filename):
          f = file(filename, "r")
+         global url_dict
          url_dict = pickle.load(f)
          f.close()





More information about the Spambayes-checkins mailing list