[Spambayes-checkins] spambayes tokenizer.py,1.42,1.43
Neil Schemenauer
nascheme@users.sourceforge.net
Sat, 28 Sep 2002 21:14:39 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv11632
Modified Files:
tokenizer.py
Log Message:
Mine the To and Cc headers. This another definite win for me. I'm sure about
the log2 trick but it seems to work okay.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.42
retrieving revision 1.43
diff -C2 -d -r1.42 -r1.43
*** tokenizer.py 28 Sep 2002 18:48:52 -0000 1.42
--- tokenizer.py 29 Sep 2002 04:14:36 -0000 1.43
***************
*** 8,11 ****
--- 8,12 ----
import email.Errors
import re
+ import math
from sets import Set
***************
*** 771,774 ****
--- 772,778 ----
yield '.'.join(parts[:i])
+ def log2(n, log=math.log, c=math.log(2)):
+ return log(n)/c
+
uuencode_begin_re = re.compile(r"""
^begin \s+
***************
*** 963,966 ****
--- 967,980 ----
for t in tokenize_word(w):
yield prefix + t
+
+ # To:
+ # Cc:
+ # Count the number of addresses in each of the recipient headers.
+ for field in ('to', 'cc'):
+ count = 0
+ for addrs in msg.get_all(field, []):
+ count += len(addrs.split(','))
+ if count > 0:
+ yield '%s:2**%d' % (field, round(log2(count)))
# These headers seem to work best if they're not tokenized: just