[Spambayes] training WAS: aging information

Moore, Paul Paul.Moore at atosorigin.com
Wed Feb 19 09:43:57 EST 2003


From: D. R. Evans [mailto:N7DR at arrisi.com]
> I saw a comment in the LJ article that one should train on roughly 
> equal numbers of spam and ham. Is this actually true? (This question of 
> course merely demonstrates that I'm too lazy to do the maths myself.)

That's something I'd be interested in, too - particularly as the ham:spam
ratio people get is utterly out of their control. I'm also too lazy - or
possibly incompetent - to do the maths, but IIRC, there were some
experiments done at one stage. A pointer to the relevant posts (or better
still, a summary on the website) would be very useful.

> One thing I've learned by doing the training is that approximately 10% 
> of my mail is spam. I'm surprised, because I would have guessed that 
> the proportion was lower than that. I guess that I had got to the point 
> where I mentally just filtered it out of consciousness as I clicked the 
> "delete" button every morning on the night's accumulation of the stuff.

Unfortunately for me, my ham:spam ratio is something like 99% *spam*. This
is because I run a highly filtered setup, with all my mailing list traffic
getting taken out of the mail stream before spambayes gets a look in.

So bad results from serious imbalances is a big problem for me. I can get
round it by pre-training on my existing inbox, but the imbalance is going
to be big one way or the other at the start.

I *really* need spambayes, not to filter out the spam, but for the other
side of the coin - to find the real mail in the mass of junk.

I regularly consider switching to a new account, but never do because
tracking down the places where my existing mail address is published
"legitimately" is just too much like hard work :-(

Paul.



More information about the Spambayes mailing list