[Spambayes] train_on_filter

Seth Goodman delete at GoodmanAssociates.com
Wed Nov 26 12:23:36 EST 2003


> This is currently not supported in Outlook.  I'm working on a patch that
> supports train_on_filter style training as well as automatic balancing.
> Unfortunately I haven't had much time to put into it lately.
>
> --
> Kenny Pitt

That is exactly what I was thinking about.  My goal is to keep the false
negative rate down without constant fiddling with the database and periodic
retraining.  My present run is as follows: initial corpus 650 ham and 650
spam eight days ago, ham threshold = 5%, spam threshold = 90%, train on all
spam that scores less than 50% and occasionally add ham to rebalance.  I add
the spams with the lowest score first and reclassify after each spam
addition to simulate continuous training-on-errors and don't count a spam as
missed if it's score subsequently goes above 5%.  I now have 808 ham and 825
spam.  The false negative rate was initially 13% and has gone down to 2-4%.
During the eight days, the classifier missed 92 out of 1363 total spams.
There were no false positives.  The actual number missed was higher, but
this is what I would have seen with a "continuous train-if-below-threshold"
scheme.  I will stay with the present scheme for a while to see how it goes.

I have been thinking about how to keep the databases to a reasonable size
and balanced, so the classifier remains agile to new spam, without getting
overly complicated.  One idea is to separately store the token set for each
message, the timestamp when those tokens were added and the message ID.
When the database maximum message count was reached and we needed to train a
new message, we could first delete all the tokens from the oldest message in
the database.

Is there experience with this type of scheme vs. database keeps growing?

Is there experience on continuous training vs. train-on-error vs.
train-on-below-certain-score?


I am saving my complete message stream, at least for a while, in the hopes
that I can compare different strategies on the same data.  My guess is that
you folks have already done quite a lot of this.  Are there tools to take a
message corpus and run it through serially to simulate a message stream?

--
Seth Goodman

  Humans:   change "delete" to "sethg" to email me

  Spambots: disregard the above




More information about the Spambayes mailing list