[Spambayes-checkins] spambayes/testtools urlslurper.py,1.2,1.3
Mark Hammond
mhammond at users.sourceforge.net
Thu May 1 17:09:23 EDT 2003
Update of /cvsroot/spambayes/spambayes/testtools
In directory sc8-pr-cvs1:/tmp/cvs-serv2431
Modified Files:
urlslurper.py
Log Message:
I may have broken a global in my last checkin - sorry about that. Cache works again.
Only slurp HTML content - avoid gifs etc
Index: urlslurper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/testtools/urlslurper.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** urlslurper.py 1 May 2003 11:46:51 -0000 1.2
--- urlslurper.py 1 May 2003 23:09:20 -0000 1.3
***************
*** 116,119 ****
--- 116,123 ----
url_dict = {}
+ # Exception to raise when we should skip the URL
+ class IgnoreURLException(Exception):
+ pass
+
def body_tokens(msg):
tokens = Tokenizer().tokenize_body(msg)
***************
*** 164,167 ****
--- 168,175 ----
try:
f = urllib2.urlopen(url)
+ # Anything that isn't text/html is ingored
+ content_type = f.headers.get('content-type')
+ if not content_type.startswith("text/html"):
+ raise IgnoreURLException("content type='%s'" % (content_type,))
page = f.read()
f.close()
***************
*** 170,176 ****
if options["globals", "verbose"]:
print >> sys.stderr, "Slurped."
! except (IOError, socket.error):
url_dict[url] = 0.5
! print >> sys.stderr, "Couldn't get", url
if not url_dict.has_key(url) or url_dict[url] != 0.5:
# Create a fake Message object since Tokenizer is
--- 178,184 ----
if options["globals", "verbose"]:
print >> sys.stderr, "Slurped."
! except (IgnoreURLException, IOError, socket.error), details:
url_dict[url] = 0.5
! print >> sys.stderr, "Couldn't get %s (%s)" % (url, details)
if not url_dict.has_key(url) or url_dict[url] != 0.5:
# Create a fake Message object since Tokenizer is
***************
*** 222,225 ****
--- 230,234 ----
if os.path.exists(filename):
f = file(filename, "r")
+ global url_dict
url_dict = pickle.load(f)
f.close()
More information about the Spambayes-checkins
mailing list