[Spambayes] aging information

Tim Peters tim.one at comcast.net
Mon Feb 17 22:39:35 EST 2003


[D. R. Evans]
> Does spambayes have any concept that "the older information is, the
> less value it has"?

At the start, and for a long time after, the database stored a timestamp
with each token, recording the most recent time the token was actually used
during scoring.  This was intended to be the basis for "aging" algorithms,
but nobody made time to investigate those, and I believe the timestamp
fields were even removed from the database.

So far in real life I haven't seen any need for it, and there are reasons
for caution.  A theoretical reason is that training is done by adding whole
messages, and spam probability guesses are based on that.  It's quite
unclear what happens to the mathematical underpinnings if tokens are removed
individually, instead of untraining on entire messages (i.e., the reverse of
the way training was done).  I doubt it would hurt, but intuition is a poor
guide here.

A practical concern is that people fear false positives to an extraordinary
degree, and if your email is anything like mine, there are a few dozen old
acquaintances I hear from about once per year.  These are generally short
"how ya doin'?" msgs, similar in that way to low-key porn spam of the form

    Hey there! How's it going, it's Jacce...we spoke a little while back
    through the personals. I hope you remember me! Well I promised I'd
    let you know when I got my my webcam thingy up and I finally did!
    <insert porn portal URL here>

Header clues that a message "like that" came from someone I trained on as
ham two years ago remain valuable today, despite that such clues have sat
idle for two years.

In real life, I'm not finding significant database growth over time simply
because I do little training anymore.  If my database size were a problem, I
expect a gross approach like purging all words with spamprobs in (.4, .6)
would give quick relief without damaging error rates more than I care about.
But that's untested, and intuition is still a poor guide <wink>.




More information about the Spambayes mailing list