[Spambayes] training suggestions

skip at pobox.com skip at pobox.com
Fri Aug 4 01:54:17 CEST 2006


    Dhaval> I just trained manually and realize that it says trained 190 out
    Dhaval> of 190 messages even though there are only ~20 new messages
    Dhaval> since the last training. Is the output wrong?

It's been awhile since I last used sb_mboxtrain.  I don't recall there ever
being a problem with its counting though.  I don't use Maildir though.
Maybe there's some issue there.

    Dhaval> My problem with running it at all is misclassification. How do I
    Dhaval> get it to fix the ones that are misclassified?

Which ones are misclassified?  Do you want it to reclassify the mail in your
users' inboxes or do *they* make classification mistakes (or a bit of both)?

    Dhaval> According to your last sentence, when I leave out the -f, I
    Dhaval> would also not train the message that was properly classified
    Dhaval> right? Or maybe not. How would/should incermental training be
    Dhaval> handled if the previous training occured with
    Dhaval> misclassifications?

I don't know how your users build up their ham/spam databases.  Let's assume
for the moment that they simply save misclassified mail to one of two
special mailboxes: ham and spam and that incoming mail that is classified as
spam lands in a mailbox called inspam.  They may well have many other
mailboxes though.  Your process might look like this:

    1. Check user's ham and spam mailboxes.  If they haven't been updated
       since the last run, quit.

    2. Retrain (incremental or total, as suits your environment).

    3. Check the user's other mailboxes.  If any have been updated since the
       training run the day before, reclassify them.  Any messages whose
       classification changes from unsure or ham to spam is migrated to the
       inspam mailbox.

You're done.  Users can check their inspam mailbox whenever they want to try
and locate false positives.  You might also give them an inham mailbox.  If
you rescore it, any messages which change score from spam to ham get moved
there.

I've never done anything for a group like this and don't know enough about
your environment to do more than speculate about an eventual solution.

Skip



More information about the SpamBayes mailing list