[Spambayes] Spambayes repeatedly classifies essages frommailinglist as SPAM despite multiple (20+)recoveries fromspam folder

Meyer, Tony T.A.Meyer at massey.ac.nz
Mon Sep 8 22:20:34 EDT 2003


> I have several dozen folders in my mailbox that contain 
> different types of ham. All told, this is about 7000 
> messages, and I have about 1500 spam messages. I used these 
> as my training corpus with plug-in version 007.
> 
> Should I instead create a "sample" folder of ham that 
> contains about 1500 messages and train with that?

That's probably a good idea.  A little imbalance doesn't hurt (you could
have 2000, for example), but equal numbers are best.

> What about adding a feature to the plug-in that would could 
> the number of messages in each training folder, then use a 
> random subsample of each folder (spam or ham) as necessary to 
> create a balanced training corpus?

An interesting idea.  I've opened a feature request here:
<http://sourceforge.net/tracker/index.php?func=detail&aid=802341&group_i
d=61702&atid=498106>

We'll see what Mark has to say ;)

=Tony Meyer



More information about the Spambayes mailing list