[Spambayes] Outlook plugin - training

Tim Peters tim.one@comcast.net
Thu Nov 7 21:00:21 2002


[Tim]
> It will also create a database size problem:  without a strategy for
> pruning useless words, the database will grow without bounds

[Charles Cazabon]
> Did you actually find this?

Yes.

> I found the growth tailed off dramatically after not too long.

That too -- the second derivative is negative from the start, but the first
remains positive.  "It's like" log that way, growing ever more slowly, but
inexorably.

> I no longer have the exact numbers, but database growth for
> me tailed off almost to nothing after I had trained on something like
> 1500 messages.

When I run my c.l.py test, 10 classifiers are built each training on about
30,000 msgs.  The classifier pickles hug 18MB each then.  My classifier at
work has been trained on about 1,100 msgs, and its classifier pickle is
about 2MB.  My classifier at home has been trained on about 3,000 msgs, and
its classifier pickle is about 4MB.  That last one is from memory, so when I
get home I'll make up a different number so that the three points exactly
fit a log curve <wink>.

Nobody has used this system long enough under a high enough daily load yet
to get frantic about database bloat, but the people who have run very large
tests must all be aware that it's inevitable (without pruning).  I've
already noticed the increase in startup time on my home box, due to loading
a bigger pickle every day.




More information about the Spambayes mailing list