[Spambayes] ageing out database entries

Mon Nov 10 15:11:59 EST 2003

[Seth Goodman]
> I saw mention in some other threads about methods of rebalancing spam
> and ham in the databases.  My question concerns preventing the
> databases from growing without limit.  At the moment, I have only
> about 550 ham and 750 spam, but I am still adding fifteen or twenty
> spam per day from the Unsure folder.

If you're seeing spam at the high end of the Unsure range, and you're not
seeing ham at the high end of the Unsure range, another idea is to reduce
the spam cutoff value (I use 20 and 80 as my cutoffs now).  The default
values are set closer to the endpoints to help protect new users from
initial classification glitches.

> Unless this slows down, after another three months or so, I will have
> around 2K spams and climbing.  Not only will this imbalance the data
> set,

Unless you also add more ham, sure.  But it should slow down.

> but I've heard that empirical tests show that too large a database
> decreases accuracy.

I saw no evidence of that when I ran mass tests, where tens of thousands of
each kind were trained on.  To the contrary, the more training, the better
the results, although a graph of performance versus training size looked
more logarithmic than anything else (IOW, you hit a point of diminishing
returns quickly).  Also, that was on a fixed set of messages.  The
characteristics of live data may well be different, since spam and ham do
change over time.  That's one argument in favor of expiring old data.  OTOH,
I have correspondents I hear from no more than once or twice per year, and
if I tossed the rare samples from them out of my ham training set, some of
them would have a hard time getting scored as ham the next time they wrote.
Lots of tradeoffs.

> So I am curious what, if any, measures SpamBayes takes to control the
> size of the databases.

Currently none.  I've been using my main addin database for about a year
now, and it's grown to 639 ham and 1049 spam.  I rarely train it anymore --
it's more than good enough as-is.  This is against an email load of about
700 new msgs per day, so in all I've trained on about 2-3 days' worth of
data over the past year.

> I saw the post that suggested ageing out databases to control total
> number plus same start date and then using random deletion to balance
> the data set sizes.  I also am uncomfortable with the idea of randomly
> deleting spam from the database, but have also noted that someone
> knowledgeable stated "intuition is a poor guide".

I don't like random deletion of trained ham because of what I said above.
That's an experiment I've tried, and I was unhappy with the results.  I'm
afraid we all get *some* kinds of ham that's so rare a statistical sampling
isn't going to find it.

> As a hardware guy who does signal processing for a living, I am not at
> all surprised at this in a stochastic approach like yours.  However, if
> empirical evidence tells you to keep the database size limited, an
> important step would be for the program to do this in a reasonable way,
> whatever that is.

I suggest you wait until you have a real problem before trying to solve it.
Part of "intuition is a poor guide" is that lots of solutions turn out to be
unnecessary <wink>.

> ...
> Since I don't know jack about Python, Windows API's, Outlook, or VBA,
> I can't help with that type of programming.  If you can isolate it to
> some C modules, I *can* code in that, as long as you don't talk to me
> about Windows object classes.

SpamBayes is written 100% in Python.  Since you know C, you'd find the
Python tutorial easy going, and would be doing interesting things in the
language the same day you start learning it.