[Spambayes] Leaving for another tool. [BUG + FIX]

David Abrahams dave at boost-consulting.com
Fri Dec 14 22:43:08 CET 2007


on Tue Dec 11 2007, Thomas Hruska <thruska-AT-cubiclesoft.com> wrote:

> To fix this, Spambayes needs to reclassify all messages selected by the 
> user and pick and choose which ones it actually needs to train on. 
> Here's the ruleset that should be used (PHP-like pseudocode based on 
> observations of the behavior of the POP3 proxy - sorry, I don't know 
> Python):
>
> $userclassification = $_POST["userclassify_" . $id];
> if ($userclassification == "Ham" || $userclassification == "Spam")
> {
>    $reclassified = ClassifyTheOriginalMessage($id);
>    if ($reclassified != $userclassification)
>    {
>      TrainDatabaseWithMessage($id, $userclassification);
>    }
> }
>
> That simple change (probably a 5 minute fix for the developers) will 
> keep the database really small no matter how users behave.

I use the train-to-exhaustion script contrib/tte.py (on a cron job
every night) which does something very similiar to that.  Works very
well.

One reason you might be seeing bad results is that you have one or two
misclassified messages (in a corpus that big, it's hard not to).  One
misclassification can cause things to go bad quickly.  In fact, if you
use tte.py, in addition to getting better results, you may be able to
identify misclassifications.  See
http://article.gmane.org/gmane.mail.spam.spambayes.devel/3902

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com



More information about the SpamBayes mailing list