[Spambayes] Train reliably on "forwarded" messages?

Wed Feb 4 19:44:18 EST 2004

> I run spambayes on my mail server using procmail and it works
> brilliantly. Thank you thank you thank you. 

Since you're using procmail, the solution that Rhesa suggested is probably
the best, but FWIW:

> What I'd like to do, for training, is forward "spam to be
> trained" to a special address (e.g. train-spam at myserver.com) 
> and similarly for ham - train-ham at mailserver.com. I would 
> then run the mailboxtrain.py on the server on the inboxes for 
> those two dummy accounts. 

The basic functionality of what you're after here (training by forwarding
mail) is done with the SMTP proxy that SpamBayes includes.  You access this
via sb_server - I'm not sure if you're using this already, or using
sb_filter, but you can use sb_server without a POP3 proxy, like you can use
it without a SMTP proxy.

Basically, you send all your outgoing mail through the SMTP proxy (this
assumes you're using SMTP for outgoing mail, of course).  It intercepts (and
does not send) any mail addresses to two special addresses and trains the
database based on those.

> However, I realize that the mail messages in these two
> inboxes will look a little different than when they showed up 
> in my inbox ("Forwarded" headers, addressed to "train-spam" 
> rather than "berendes", >>> down the side).

To avoid this, the SMTP proxy has the ability to 'look up' the original
message and use that for training instead of the mangled message.  To do
this the original message must be on an imap server (this hasn't really been
used or tested much) or in the sb_server/pop3proxy cache directories.  In
addition, the mail client must forward all the headers of the original
message (Eudora does, I believe).  If you're using sb_server, then this will
work fine.  If you're using sb_filter, this probably won't, although you
could easily enough patch it to find the message elsewhere (for example if
you saved a raw copy as in Rhesa's solution).  The SMTP proxy can be set to
just train on the raw mail sent to it, however.

> Will this "forward junk" throw off the training process?

Yes.  How much and whether it will have a significant effect is uncertain
and depends a lot on your mail stream itself.  You would probably want to
fiddle with the tokenizing settings so that fewer tokens are generated from
the headers (at least in training).

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.