[Spambayes-checkins] spambayes/spambayes tokenizer.py,1.43,1.44

Skip Montanaro montanaro at users.sourceforge.net
Mon Aug 7 04:47:13 CEST 2006


Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv10981

Modified Files:
	tokenizer.py 
Log Message:
In splicing back several changes one-by-one I completely left out the code
to handle x-lookup_ip...  That would explain why my testing today didn't
show any improvement!

Also, tweak image-size to only yield a single token, and only if there is at
least one decodable image.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.43
retrieving revision 1.44
diff -C2 -d -r1.43 -r1.44
*** tokenizer.py	6 Aug 2006 20:55:10 -0000	1.43
--- tokenizer.py	7 Aug 2006 02:47:10 -0000	1.44
***************
*** 1085,1088 ****
--- 1085,1103 ----
              scheme, netloc, path, params, query, frag = urlparse.urlparse(url)
  
+             if cache is not None and options["Tokenizer", "x-lookup_ip"]:
+                 ips=cache.lookup(netloc)
+                 if len(ips)==0:
+                     pushclue("url-ip:timeout")
+                 else:
+                     for ip in ips: # Should we limit to one A record?
+                         pushclue("url-ip:%s/32" % ip)
+                         dottedQuadList=ip.split(".")
+                         pushclue("url-ip:%s/8" % dottedQuadList[0])
+                         pushclue("url-ip:%s.%s/16" % (dottedQuadList[0],
+                                                       dottedQuadList[1]))
+                         pushclue("url-ip:%s.%s.%s/24" % (dottedQuadList[0],
+                                                          dottedQuadList[1],
+                                                          dottedQuadList[2]))
+ 
              # one common technique in bogus "please (re-)authorize yourself"
              # scams is to make it appear as if you're visiting a valid
***************
*** 1605,1608 ****
--- 1620,1624 ----
              # each image.
              
+             total_len = 0
              for part in parts:
                  try:
***************
*** 1612,1621 ****
                      text = part.get_payload(decode=False)
  
                  if text is None:
                      yield "control: image payload is None"
-                     continue
  
!                 if text:
!                     yield "image-size:2**%d" % round(log2(len(text)))
  
          if options["Tokenizer", "x-crack_images"]:
--- 1628,1637 ----
                      text = part.get_payload(decode=False)
  
+                 total_len += len(text or "")
                  if text is None:
                      yield "control: image payload is None"
  
!             if total_len:
!                 yield "image-size:2**%d" % round(log2(total_len))
  
          if options["Tokenizer", "x-crack_images"]:



More information about the Spambayes-checkins mailing list