[Spambayes] Training corrupts mbox files
Toby Dickenson
tdickenson at geminidataloggers.com
Thu May 1 19:58:48 EDT 2003
On Thursday 01 May 2003 6:21 pm, Skip Montanaro wrote:
> Toby> fwiw, I stopped using mboxtrain and its incremental mode in favor
> Toby> of hammie, and always doing a full train on whole mailboxes. Its
> Toby> not significantly slower.
>
> How big are your mailboxes? I have about 12,000 hams and 7,000 spams in my
> training sets, so I generally avoid full retrains.
>
> I'm considering a somewhat different procmail-based setup for some other
> people, however, in which they would have three email addresses,
> foo at somewhere, foo+spam at somewhere and foo+ham at somewhere. The last two
> would (obviously) be for training. My thought was to simply have the
> training aliases append to mbox files and run mboxtrain from cron
> periodically. I'd logrotate the training files to keep the number of files
> and their sizes to a minimum.
>
> Someone else must already be doing something like this. Care to share?
I am using kmail with approximately 40 folders (mailboxes). I am training
directy from the kmail folders. That means I dont need duplicate copies of
emails in a seperate training database, I can use the normal kmail gui for
adjusting the training sets, and ensures that training doesnt use ancient
emails.
I use kmail to delete personal emails after 6 months, mailing lists after a
few weeks, and spams after a year. That keeps the total content stable at
about 6000 hams and 800 spams.
I train overnight from cron, and it takes about 5 minutes. From memory,
incremental mboxtrain was taking about 4 minutes with a lower cpu usage.
I have a script that generates a long hammie.py command line by parsing the
kmail configuration file. It assumes that:
- the folder called "spam" and all its subfolders are spam training material
- "trash" and "drafts" should be ignored
- every other folder contains ham training material.
I use procmail to run the hammie filter to add the headers during mail
delivery. kmail filters are used to sort incoming mail: spam into a seperate
folder. (for a while my wife was using the same setup, but running the hammie
filter from kmail. No procmail needed)
I use two folders for spam..... spam/archive and spam/new. kmail filters the
spam into spam/new and marks it read. Every week I review spam/new for false
positives (Im still waiting for my first!), then empty it into spam/archive.
Any interest in better documentation of this setup?
--
Toby Dickenson
http://www.geminidataloggers.com/people/tdickenson
More information about the Spambayes
mailing list