[Spambayes] incremental training strategies

Skip Montanaro skip@pobox.com
Mon Oct 28 18:42:08 2002


    Alex> Speaking from a theoretical purity standpoint, I suspect that
    Alex> training it on everything that came through would be
    Alex> 'cleaner'... but I have no idea if in practise it would work any
    Alex> better than just training on the mistakes and unsure.

Yeah, but theory and practice often disagree. ;-) The biggest problem I see
in training it on every message you encounter is you are likely to make
mistakes, generally of the inattentiveness or fumble-fingered variety.
That's fine when you're testing the algorithm.  You migrate the message to
the other pool, then test again.  It's a bit different proposition if you
are training messages on-the-fly, then delete them (or even if you don't
delete them).  How do you realize you misclassified a message?  If you
realize you misclassified a message, how do you undo the effect of the
misclassification, particularly if you no longer have the message laying
around?

>From the standpoint of minimizing human error, once you have a decent
hammie.db file, it seems to me that only training on either unsure or
incorrect messages is likely to be the best way to improve it.

Skip