[Spambayes-checkins] spambayes tokenizer.py,1.40,1.41

Neil Schemenauer nascheme@users.sourceforge.net
Thu, 26 Sep 2002 21:06:15 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv4807

Modified Files:
	tokenizer.py 
Log Message:
Add basic message-id tokenization.  Right now it just checks that it
exists and conforms to the usual syntax.  If it does, the host part is
also returned.  I tried doing more but the extra stuff was never
considered a good discriminator.  Stupid wins again. :-)


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.40
retrieving revision 1.41
diff -C2 -d -r1.40 -r1.41
*** tokenizer.py	27 Sep 2002 01:28:43 -0000	1.40
--- tokenizer.py	27 Sep 2002 04:06:12 -0000	1.41
***************
*** 597,600 ****
--- 597,602 ----
  received_ip_re = re.compile(r'\s[[(]((\d{1,3}\.?){4})[\])]')
  
+ message_id_re = re.compile(r'\s*<[^@]+@([^>]+)>\s*')
+ 
  # I'm usually just splitting on whitespace, but for subject lines I want to
  # break things like "Python/Perl comparison?" up.  OTOH, I don't want to
***************
*** 981,984 ****
--- 983,996 ----
                          for tok in breakdown(m.group(1).lower()):
                              yield 'received:' + tok
+ 
+         if options.mine_message_ids:
+             msgid = msg.get("message-id", "")
+             m = message_id_re.match(msgid)
+             if not m:
+                 # might be weird instead of invalid but who cares?
+                 yield 'message-id:invalid'
+             else:
+                 # looks okay, return the hostname only
+                 yield 'message-id:@%s' % m.group(1)
  
          # As suggested by Anthony Baxter, merely counting the number of