[Spambayes] Training corrupts mbox files

Toby Dickenson tdickenson at geminidataloggers.com
Thu May 1 19:58:48 EDT 2003


On Thursday 01 May 2003 6:21 pm, Skip Montanaro wrote:
>     Toby> fwiw, I stopped using mboxtrain and its incremental mode in favor
>     Toby> of hammie, and always doing a full train on whole mailboxes. Its
>     Toby> not significantly slower.
>
> How big are your mailboxes?  I have about 12,000 hams and 7,000 spams in my
> training sets, so I generally avoid full retrains.
>
> I'm considering a somewhat different procmail-based setup for some other
> people, however, in which they would have three email addresses,
> foo at somewhere, foo+spam at somewhere and foo+ham at somewhere.  The last two
> would (obviously) be for training.  My thought was to simply have the
> training aliases append to mbox files and run mboxtrain from cron
> periodically.  I'd logrotate the training files to keep the number of files
> and their sizes to a minimum.
>
> Someone else must already be doing something like this.  Care to share?

I am using kmail with approximately 40 folders (mailboxes). I am training 
directy from the kmail folders. That means I dont need duplicate copies of 
emails in a seperate training database, I can use the normal kmail gui for 
adjusting the training sets, and ensures that training doesnt use ancient 
emails.

I use kmail to delete personal emails after 6 months, mailing lists after a 
few weeks, and spams after a year. That  keeps the total content stable at 
about 6000 hams and 800 spams.

I train overnight from cron, and it takes about 5 minutes. From memory, 
incremental mboxtrain was taking about 4 minutes with a lower cpu usage.

I have a script that generates a long hammie.py command line by parsing the 
kmail configuration file. It assumes that:
- the folder called "spam" and all its subfolders are spam training material
- "trash" and "drafts" should be ignored
- every other folder contains ham training material.

I use procmail to run the hammie filter to add the headers during mail 
delivery. kmail filters are used to sort incoming mail: spam into a seperate 
folder. (for a while my wife was using the same setup, but running the hammie 
filter from kmail. No procmail needed)

I use two folders for spam.....  spam/archive and spam/new. kmail filters the 
spam into spam/new and marks it read. Every week I review spam/new for false 
positives (Im still waiting for my first!), then empty it into spam/archive.


Any interest in better documentation of this setup?

-- 
Toby Dickenson
http://www.geminidataloggers.com/people/tdickenson



More information about the Spambayes mailing list