[Spambayes] ageing out database entries

Kenny Pitt kennypitt at hotmail.com
Fri Nov 14 15:48:51 EST 2003


Seth Goodman wrote:
> [Kenny Pitt]
>> In K9, the limits you're talking about only control how many complete
>> messages of each type are stored in cache for future *re-training*.
>> They do not affect the contents of the actual training database.  K9
>> does not currently do any aging of the training data, although I
>> believe it has been discussed in that context as well.
> 
> Today I just saw the following entry in the K9 configuration
> instructions: 
> 
> "In addition to automatically cleaning up the Recent Emails list, you
> can choose to clean out the Good and Spam folders of old emails when
> they reach a certain size."

Yes, but that still only affects the token databases if you *rebuild*
them.

> It looks like K9 *does* age out old messages once the message counts
> for spam or ham are reached and a new message is available for
> training...

Yes, it does age out old messages from the cache.

> ... It appears that they train on everything, ...

Yes, it does train on everything you receive (which may or may not be a
good thing, opinions differ).  In recent versions, it also has an "Only
add reclassified emails" option in the Advanced config that trains more
like the SpamBayes Outlook addin.  If this option is on, tokens won't
get added to the training databases unless you reclassify the message,
but the actual message will still be saved in the message cache.

> ... and when the
> maximum number of messages in the corpus is reached, they make room
> for the new message by deleting and untraining the oldest one...

It deletes the oldest message from the corpus, but does not untrain it.
K9 does not store the original scoring of the message, and it allows you
to re-organize at any time to place messages in the correct corpus
according to the current scoring.  There is no guarantee that the corpus
a message is currently stored in matches how it was originally trained.
You can also delete the entire contents of the cache at any time without
having any effect on the scoring, until you rebuild the training
databases.

> ... This
> appears to be another reason why K9 stores the two complete training
> corpuses.  This avoids having to timestamp every token, which is a
> big saving. 

The only reason K9 keeps the actual messages is so that you can rebuild
the training databases at any time.  Understanding the significance of
this requires that you go back to the early days of K9 when it used a
slightly different method of training.  Originally, K9 did not have the
ability to "untrain" a message.  Every message was added to either the
spam or good word database as soon as it was received and scored.  If
the user reclassified the message, it was repeatedly added to the
reclassified database until the new score for the message surpassed a
certain threshold (which I believe was somewhere around >80% for spam
and <20% for good, give or take).  Obviously, this could result in some
strange training data over time, so an occasional rebuild of the
databases was useful to make sure that each message was trained in only
one of the spam or good databases.

-- 
Kenny Pitt




More information about the Spambayes mailing list