[Spambayes-checkins] spambayes tokenizer.py,1.40,1.41
Neil Schemenauer
nascheme@users.sourceforge.net
Thu, 26 Sep 2002 21:06:15 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4807
Modified Files:
tokenizer.py
Log Message:
Add basic message-id tokenization. Right now it just checks that it
exists and conforms to the usual syntax. If it does, the host part is
also returned. I tried doing more but the extra stuff was never
considered a good discriminator. Stupid wins again. :-)
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** tokenizer.py 27 Sep 2002 01:28:43 -0000 1.40
--- tokenizer.py 27 Sep 2002 04:06:12 -0000 1.41
***************
*** 597,600 ****
--- 597,602 ----
received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]')
+ message_id_re = re.compile(r'\s*<[^@]+@([^>]+)>\s*')
+
# I'm usually just splitting on whitespace, but for subject lines I want to
# break things like "Python/Perl comparison?" up. OTOH, I don't want to
***************
*** 981,984 ****
--- 983,996 ----
for tok in breakdown(m.group(1).lower()):
yield 'received:' + tok
+
+ if options.mine_message_ids:
+ msgid = msg.get("message-id", "")
+ m = message_id_re.match(msgid)
+ if not m:
+ # might be weird instead of invalid but who cares?
+ yield 'message-id:invalid'
+ else:
+ # looks okay, return the hostname only
+ yield 'message-id:@%s' % m.group(1)
# As suggested by Anthony Baxter, merely counting the number of