[Spambayes] Database growing bigger

Tony Meyer tameyer at ihug.co.nz
Wed Apr 27 04:50:41 CEST 2005


> I am very happy with Spambayes' performance. I have it running in a
> number of different environments including two Linux boxes and a
> larger number of Windows machines. With the Linux boxes, I wonder
> if there's a point of diminishing returns with training - I ask
> because training is becoming quite unwieldy at this point -

If there is such a point, it will be cross-platform, since the
classification is platform-agnostic.

> when I
> go to the training window, it loads around 4-500 messages and this
> can become tedious - especially when there's a message I'm unsure 
> of; if I click it to see what's in it, the return to the 
> training page has put all the checkmarks back to where they were
> when I first opened it, and I have to go through the whole list again.

How are you going back to the review page?  If you use the browser's "back"
button, then the browser ought to display the page with all the checkmarks
as they were (some browsers are better at doing this than others).
Alternatively, you could open the 'view message' in a new window/tab, which
would work around this.

Note that you can set the default actions to take for messages, too
(Advanced Configuration page), which might make this process faster.
There's also an option to not cache messages with the 'bulk' header, which
includes most well-behaved mailing lists, which typically have no or little
spam - using that option might also help.

What sort of training are you doing?  Sb_server still defaults to training
ham, and discarding spam, I think.  It would be better to do mistake-based
training, where you only train any false positives, false negatives and
unsures (and adjust the thresholds if necessary, to reduce the number of
(particularly spam) unsures).  There's lots more about this at:

<http://entrian.com/sbwiki/TrainingIdeas>

> Is there a point at which it is better to delete the database and start
> training anew? I know this is probably a hard question to answer, but,
> I wonder if you have some thoughts on this subject. 

There probably is, but I don't know when it is.  I personally start from
scratch every few months or so, but that's almost always because I'm testing
out an experimental database format and something goes wrong with it,
forcing a retrain.

AFAIK no-one has done any testing on this, although there has been tests on
'aging' a database (removing messages after a certain amount of time), which
did OK, IIRC, but not significantly better than other training techniques.

Supporting different types of training is one way that I think SpamBayes
(specifically the Outlook plug-in and sb_server) could really improve.  No
time to work on that, yet, unfortunately.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. 



More information about the Spambayes mailing list