[spambayes-dev] Another incremental training idea...

Seth Goodman nobody at spamcop.net
Tue Jan 13 20:09:22 EST 2004


[T. Alex Popiel]
> In any case, these complete-rebuilding scenarios are predicated on
> keeping at least the token list for every single message (not just
> those that have been trained) for as long as we might want to train
> on them.  There is some indication that a significant portion of the
> userbase is unwilling to keep that much mail data lying around
> (for months, presumably)... which makes the other form of expiry
> of more practical interest.

I keep that much mail around, but I certainly agree that most people do not
like to save spam.  It's either back to expiration, then, or just keep on
training, as you suggested.

I do have a question on your incremental harness with expiry, since it's
surprising how much worse it performs as soon as it starts expiring
messages.  For classification purposes, you obviously use the training set
from the last 120 days of nonedge messages.  Do you then use those same
scores for the current day's messages to determine which are the nonedge
messages?  I ask this because you would get a different set of messages to
train on, and perhaps compensate better for the particular messages you
expire, if you first expired the 120-day old messages, then rescored the
current day's messages to determine the nonedge messages to train on.  Does
this make any sense?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above




More information about the spambayes-dev mailing list