[spambayes-bugs] [ spambayes-Feature Requests-802341 ]
Auto-balancing of ham & spam numbers
SourceForge.net
noreply at sourceforge.net
Fri Sep 12 23:02:15 EDT 2003
Feature Requests item #802341, was opened at 2003-09-08 02:20
Message generated for change (Comment added) made by leobru
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702
Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Tony Meyer (anadelonbrin)
Summary: Auto-balancing of ham & spam numbers
Initial Comment:
>From spambayes at python.org
"""
What about adding a feature to the plug-in that would
could the number of messages in each training folder,
then use a random subsample of each folder (spam or
ham) as necessary to create a balanced training corpus?
"""
This seems like a reasonable idea (as an option), and
might work better than the experimental imbalance
adjustment, which has caused various people difficulties
(because they are *very* imbalanced). What do you
think?
----------------------------------------------------------------------
Comment By: Leonid (leobru)
Date: 2003-09-12 20:02
Message:
Logged In: YES
user_id=790676
I don't know if it is a generally good idea or not, but I
forward everything that scores as 1.00 spam directly to
/dev/null (this way there is no way to train on it). This
effectively implements the idea "do not train on VERY spammy
spam". Works for me; about 80% of all messages (or 90% of
all spam) is immediately thrown away, and the ham/spam
numbers do not get skewed. 3 months, and not a single
non-spam mass mailing in my spam box (in "unsure" in the
worst case).
----------------------------------------------------------------------
Comment By: Mark Hammond (mhammond)
Date: 2003-09-08 06:09
Message:
Logged In: YES
user_id=14198
This isn't Outlook specific, so you can have it back :) The
big problem I see is *what* ones to choose? Skipping spam
may be possible, but skipping a single ham to train on could
be a huge problem.
Maybe we could train on all spam, then score all spam, then
re-train using only the least spammy spam - but I think the
answer to
http://spambayes.sourceforge.net/faq.html#why-don-t-you-implement-cool-tokenizer-trick-x
may be relevant <wink>
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702
More information about the Spambayes-bugs
mailing list