[Spambayes] ageing out database entries

Kenny Pitt kennypitt at hotmail.com
Mon Nov 10 15:19:16 EST 2003


Seth Goodman wrote:
> However, if empirical evidence tells
> you to keep the database size limited, an important step would be for
> the program to do this in a reasonable way, whatever that is.  One
> other Bayesian spam program (K9) sets separate limits on the number
> of spam and ham, trains on every message and deletes the oldest
> message when a new on comes in.  This achieves whatever spam/ham
> balance you want without regard to start date.

In K9, the limits you're talking about only control how many complete
messages of each type are stored in cache for future *re-training*.
They do not affect the contents of the actual training database.  K9
does not currently do any aging of the training data, although I believe
it has been discussed in that context as well.

In the training database, both K9 and SpamBayes store only a list of
tokens with counts of how many times each has been seen in spam and in
ham.  No other information is stored about the original message that the
token was seen in.  The most effective way of aging out tokens would
seem to be to keep track of the date that each token was last seen, and
set a threshold that says if a token has not been seen in n days then
remove it from the training data.  Unfortunately, this adds a
significant amount of size to the training database as well as
increasing the amount of work to be done when classifying a message
(thus decreasing the performance).

-- 
Kenny Pitt




More information about the Spambayes mailing list