[spambayes-dev] lowest scoring message isn't always "best" one totrain on

Seth Goodman nobody at spamcop.net
Mon Jan 19 12:20:31 EST 2004


[Skip Montanaro]
> Note that the first item has a very low spamprob itself, but of
> the bunch I
> displayed, the best ones to train on to push the most other spams
> into spam
> range all score around 0.8 to 0.9.  ...

I can add some anecdotal evidence to that.  My manual training regime for
coming up with a reduced training set involves iteratively training on the
lowest scoring spam until all untrained spam scores above 90%.  I've noticed
that I also get the most shifting of untrained spam classifications from
unsure to spam on the later messages I train on, that is, the ones with
higher scores.  My recollection is that things start to move much better
when the spam I add to the training set is around 75% or higher.  The
low-scoring unsures do move a few other low-scoring unsures up in score, but
I seem to get considerably more "action" out of the higher-scoring ones.
Since I stop at 90%, I have no experience as to what cutoff is optimal.

I like your concept of doing this explicitly.  With a small number of
unsures, as in a nightly training session, it would not take very long even
though it is O(n^2).

--
Seth Goodman

replies to sethg [at] GoodmanAssociates [dot] com




More information about the spambayes-dev mailing list