[spambayes-bugs] [ spambayes-Feature Requests-802341 ]
Auto-balancing of ham & spam numbers
SourceForge.net
noreply at sourceforge.net
Mon Sep 15 22:35:36 EDT 2003
Feature Requests item #802341, was opened at 2003-09-08 05:20
Message generated for change (Comment added) made by tim_one
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702
Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Tony Meyer (anadelonbrin)
Summary: Auto-balancing of ham & spam numbers
Initial Comment:
>From spambayes at python.org
"""
What about adding a feature to the plug-in that would
could the number of messages in each training folder,
then use a random subsample of each folder (spam or
ham) as necessary to create a balanced training corpus?
"""
This seems like a reasonable idea (as an option), and
might work better than the experimental imbalance
adjustment, which has caused various people difficulties
(because they are *very* imbalanced). What do you
think?
----------------------------------------------------------------------
>Comment By: Tim Peters (tim_one)
Date: 2003-09-15 22:35
Message:
Logged In: YES
user_id=31435
Yup, I agree it's fraught with dangers. Note that we'd also
need to remember which msgs were explicitly trained as
mistakes or unsures, to help prevent them from getting
mistreated again. For example, I have a few strange friends I
hear from maybe twice a year, and the stuff they send is so
bizarre I have to keep several years' worth of their msgs in
my ham training set (and, yes, I do think it's ham <wink>).
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2003-09-15 21:56
Message:
Logged In: YES
user_id=552329
Another problem with this is that these require either the user
keeping spam around, or storing a *lot* more data. Ryan's
scheme below is really two separate things - one is aging out
old data, which has been discussed a few times, and then
randomly selecting from what's left.
I tend to agree with Mark. I think this might end up like the
experimental_ham_spam_imbalance and confuse people. Why
doesn't x get a ham score, they ask? Because it was
randomly chosen to not get included in your training data, we
answer.
The more I think about it, the more I think that (unless
someone comes up with a new, better,
experimental_ham_spam_imbalance option), the best option is
simply to warn users if they reach a certain level of
imbalance, so that their attention is drawn to the problem.
If I find the time, I might play around with setting up a test
script to train, then retrain on balanced data and see how
that goes.
----------------------------------------------------------------------
Comment By: Mark Hammond (mhammond)
Date: 2003-09-15 09:16
Message:
Logged In: YES
user_id=14198
My problem is more with missing ham, and I fear that missing
a single ham could make the difference. Our low
false-positive rate is a feature we should keep :)
It all gets back to the test framework. As Tim is fond of
saying, intuition is a poor guide here.
----------------------------------------------------------------------
Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-15 08:37
Message:
Logged In: YES
user_id=731834
The last sentance under part 1) below should read "So we
choose our cutoff date to be 5/13/2003."
----------------------------------------------------------------------
Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-15 08:35
Message:
Logged In: YES
user_id=731834
Since I initially came up with this possible feature on
the mailing list, let me add my two cents. I don't think
throwing out any "super-spam" is the right approach, since
there might be some useful "almost-spam" information in
there. A spam might score 100% because it
contains 'viagra' and 'lowest' and 'price', fine, and we
already know about those tokens. But the same "super-
spammy" message might contian a new domain name, or a new
word like "silagra"; basically any other information that
is useful in the training database.
That said, I think a good algorithm might be based on
dates, to make sure the sampling is representative. I
suggest looking at the received date of the oldest message
in each corpus, and choosing the most recent of these
dates. Then we can count all messages from each corpus
that are newer than this date, and finally, take a random
subsample of the messages from the corpus which has "more"
new messages. The subsampling can be done on the fly by
using an RNG, you might get an error of a few messages in
each direction, but it won't affect the statistics
materially and will be easier to implement than keeping
track of a bunch of message-ids.
An example of my proposal:
1) Spam corpus: 1342, oldest is dated 5/13/2003; Ham
corpus: 6203, oldest is dated 6/19/2002. So we choose our
cutoff date to be 5/13/2002.
2) We already know there are 1342 messages in the spam
corpus newer than this date. We also count up 2987
messages in the ham corpus newer than this date. So we
want to choose 1342/2987=46.324% of the messages from the
ham corpus newer than 5/13/2003.
3) We tokenize and traing on the whole spam corups. Then
we start through the ham corpus, skipping all messages
older than 5/13/2003. If we come across a message newer
than that, we choose a random number between 0 and 1. If
the random number is less than 0.46324, we train wiht the
message. At most we should be off by a few dozen messages
from the desired 1342 trained ham.
This method gives us a balanced training set, with
representative spam and ham messages from the same time-
frame. What do you think?
Regards,
-Ryan-
----------------------------------------------------------------------
Comment By: Leonid (leobru)
Date: 2003-09-12 23:02
Message:
Logged In: YES
user_id=790676
I don't know if it is a generally good idea or not, but I
forward everything that scores as 1.00 spam directly to
/dev/null (this way there is no way to train on it). This
effectively implements the idea "do not train on VERY spammy
spam". Works for me; about 80% of all messages (or 90% of
all spam) is immediately thrown away, and the ham/spam
numbers do not get skewed. 3 months, and not a single
non-spam mass mailing in my spam box (in "unsure" in the
worst case).
----------------------------------------------------------------------
Comment By: Mark Hammond (mhammond)
Date: 2003-09-08 09:09
Message:
Logged In: YES
user_id=14198
This isn't Outlook specific, so you can have it back :) The
big problem I see is *what* ones to choose? Skipping spam
may be possible, but skipping a single ham to train on could
be a huge problem.
Maybe we could train on all spam, then score all spam, then
re-train using only the least spammy spam - but I think the
answer to
http://spambayes.sourceforge.net/faq.html#why-don-t-you-implement-cool-tokenizer-trick-x
may be relevant <wink>
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702
More information about the Spambayes-bugs
mailing list