[spambayes-bugs] [ spambayes-Feature Requests-802341 ] Auto-balancing of ham & spam numbers

Fri Sep 12 23:02:15 EDT 2003

Feature Requests item #802341, was opened at 2003-09-08 02:20
Message generated for change (Comment added) made by leobru
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Tony Meyer (anadelonbrin)
Summary: Auto-balancing of ham & spam numbers

Initial Comment:
>From spambayes at python.org

"""

What about adding a feature to the plug-in that would 

could the number of messages in each training folder, 

then use a random subsample of each folder (spam or 

ham) as necessary to create a balanced training corpus?

"""

This seems like a reasonable idea (as an option), and 

might work better than the experimental imbalance 

adjustment, which has caused various people difficulties 

(because they are *very* imbalanced).  What do you 

think?

----------------------------------------------------------------------

Comment By: Leonid (leobru)
Date: 2003-09-12 20:02

Message:
Logged In: YES 
user_id=790676

I don't know if it is a generally good idea or not, but I

forward everything that scores as 1.00 spam directly to

/dev/null (this way there is no way to train on it). This

effectively implements the idea "do not train on VERY spammy

spam". Works for me; about 80% of all messages (or 90% of

all spam) is immediately thrown away, and the ham/spam

numbers do not get skewed. 3 months, and not a single

non-spam mass mailing in my spam box (in "unsure" in the

worst case). 

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-09-08 06:09

Message:
Logged In: YES 
user_id=14198

This isn't Outlook specific, so you can have it back :)  The

big problem I see is *what* ones to choose?  Skipping spam

may be possible, but skipping a single ham to train on could

be a huge problem.

Maybe we could train on all spam, then score all spam, then

re-train using only the least spammy spam - but I think the

answer to

http://spambayes.sourceforge.net/faq.html#why-don-t-you-implement-cool-tokenizer-trick-x

may be relevant <wink>

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702