[Spambayes-checkins] spambayes tokenizer.py,1.42,1.43

Neil Schemenauer nascheme@users.sourceforge.net
Sat, 28 Sep 2002 21:14:39 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11632

Modified Files:
	tokenizer.py 
Log Message:
Mine the To and Cc headers.  This another definite win for me.  I'm sure about
the log2 trick but it seems to work okay.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.42
retrieving revision 1.43
diff -C2 -d -r1.42 -r1.43
*** tokenizer.py	28 Sep 2002 18:48:52 -0000	1.42
--- tokenizer.py	29 Sep 2002 04:14:36 -0000	1.43
***************
*** 8,11 ****
--- 8,12 ----
  import email.Errors
  import re
+ import math
  from sets import Set
  
***************
*** 771,774 ****
--- 772,778 ----
          yield '.'.join(parts[:i])
  
+ def log2(n, log=math.log, c=math.log(2)):
+     return log(n)/c
+ 
  uuencode_begin_re = re.compile(r"""
      ^begin \s+
***************
*** 963,966 ****
--- 967,980 ----
                  for t in tokenize_word(w):
                      yield prefix + t
+ 
+         # To:
+         # Cc: 
+         # Count the number of addresses in each of the recipient headers.
+         for field in ('to', 'cc'):
+             count = 0
+             for addrs in msg.get_all(field, []):
+                 count += len(addrs.split(','))
+             if count > 0:
+                 yield '%s:2**%d' % (field, round(log2(count)))
  
          # These headers seem to work best if they're not tokenized:  just