[Spambayes] training suggestions
skip at pobox.com
skip at pobox.com
Fri Aug 4 01:54:17 CEST 2006
Dhaval> I just trained manually and realize that it says trained 190 out
Dhaval> of 190 messages even though there are only ~20 new messages
Dhaval> since the last training. Is the output wrong?
It's been awhile since I last used sb_mboxtrain. I don't recall there ever
being a problem with its counting though. I don't use Maildir though.
Maybe there's some issue there.
Dhaval> My problem with running it at all is misclassification. How do I
Dhaval> get it to fix the ones that are misclassified?
Which ones are misclassified? Do you want it to reclassify the mail in your
users' inboxes or do *they* make classification mistakes (or a bit of both)?
Dhaval> According to your last sentence, when I leave out the -f, I
Dhaval> would also not train the message that was properly classified
Dhaval> right? Or maybe not. How would/should incermental training be
Dhaval> handled if the previous training occured with
Dhaval> misclassifications?
I don't know how your users build up their ham/spam databases. Let's assume
for the moment that they simply save misclassified mail to one of two
special mailboxes: ham and spam and that incoming mail that is classified as
spam lands in a mailbox called inspam. They may well have many other
mailboxes though. Your process might look like this:
1. Check user's ham and spam mailboxes. If they haven't been updated
since the last run, quit.
2. Retrain (incremental or total, as suits your environment).
3. Check the user's other mailboxes. If any have been updated since the
training run the day before, reclassify them. Any messages whose
classification changes from unsure or ham to spam is migrated to the
inspam mailbox.
You're done. Users can check their inspam mailbox whenever they want to try
and locate false positives. You might also give them an inham mailbox. If
you rescore it, any messages which change score from spam to ham get moved
there.
I've never done anything for a group like this and don't know enough about
your environment to do more than speculate about an eventual solution.
Skip
More information about the SpamBayes
mailing list