[Spambayes] A proposal for mail filtering

Tony Meyer tameyer at ihug.co.nz
Wed Dec 3 17:44:48 EST 2003


> The 
> problem is, that this is still training on all of my spam, and so 
> eventually my SPAM count will end up being too high as well.

One option, of course, is to simply not train on as much.  A lot of people
are reporting good results with less training data.

> 1. Registered as HAM
> 2. Registered as SPAM
> 3. Registered as UNSURE
> 4. Trained as HAM
> 5. Trained as SPAM

FWIW, this information is already stored.  In sb_server and sb_imapfilter
it's in the messageinfo db, with sb_filter it's a combination of the
X-SpamBayes-Trained and X-SpamBayes-Classification headers, and IIRC the
plug-in records this as well, in its equivalent of the messageinfo db.

> In mistakes mode we still "train" on all messages, but we do not add 
> the scores to either of ham or spam unless the message is being 
> re-classified. When we detect that a message has been incorrectly 
> classified then we increase the appropriate ham/spam score. To my way 
> of thinking this means that we would then need to have five states 
> associated with each message id.

This wouldn't be hard to test in sb_imapfilter.  There's a function (called
Train(), I think) that trains all messages in a folder.  For each message,
it checks if it has already been trained, and if so, untrains it first.  It
then trains the message with the new classification.  You could simply make
this last step conditional on the first.  (Not that I've tried this, but it
sounds good <wink>).

> maybe we could also automatically train on the last "x" 
> HAM/SPAM (whichever needs to be "balanced") if the ratio of 
> one to the other gets more than 1.5.

This wouldn't be that much harder to add, either.  Similar things have been
proposed in the past, and, IIRC, the main concern was that this would make
it much harder to understand what the filter is doing, since it would be
deciding what to train 'on it's own'.

Anyway, if you feel like coding, hopefully this gives you some starters :)

=Tony Meyer




More information about the Spambayes mailing list