[spambayes-bugs] [ spambayes-Feature Requests-802341 ] Auto-balancing of ham & spam numbers

Mon Sep 15 22:35:36 EDT 2003

Feature Requests item #802341, was opened at 2003-09-08 05:20
Message generated for change (Comment added) made by tim_one
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Tony Meyer (anadelonbrin)
Summary: Auto-balancing of ham & spam numbers

Initial Comment:
>From spambayes at python.org

"""

What about adding a feature to the plug-in that would 

could the number of messages in each training folder, 

then use a random subsample of each folder (spam or 

ham) as necessary to create a balanced training corpus?

"""

This seems like a reasonable idea (as an option), and 

might work better than the experimental imbalance 

adjustment, which has caused various people difficulties 

(because they are *very* imbalanced).  What do you 

think?

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2003-09-15 22:35

Message:
Logged In: YES 
user_id=31435

Yup, I agree it's fraught with dangers.  Note that we'd also 

need to remember which msgs were explicitly trained as 

mistakes or unsures, to help prevent them from getting 

mistreated again.  For example, I have a few strange friends I 

hear from maybe twice a year, and the stuff they send is so 

bizarre I have to keep several years' worth of their msgs in 

my ham training set (and, yes, I do think it's ham <wink>).

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-09-15 21:56

Message:
Logged In: YES 
user_id=552329

Another problem with this is that these require either the user 

keeping spam around, or storing a *lot* more data.  Ryan's 

scheme below is really two separate things - one is aging out 

old data, which has been discussed a few times, and then 

randomly selecting from what's left.

I tend to agree with Mark.  I think this might end up like the 

experimental_ham_spam_imbalance and confuse people.  Why 

doesn't x get a ham score, they ask?  Because it was 

randomly chosen to not get included in your training data, we 

answer.

The more I think about it, the more I think that (unless 

someone comes up with a new, better, 

experimental_ham_spam_imbalance option), the best option is 

simply to warn users if they reach a certain level of 

imbalance, so that their attention is drawn to the problem.

If I find the time, I might play around with setting up a test 

script to train, then retrain on balanced data and see how 

that goes.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-09-15 09:16

Message:
Logged In: YES 
user_id=14198

My problem is more with missing ham, and I fear that missing

a single ham could make the difference.  Our low

false-positive rate is a feature we should keep :)

It all gets back to the test framework.  As Tim is fond of

saying, intuition is a poor guide here.

----------------------------------------------------------------------

Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-15 08:37

Message:
Logged In: YES 
user_id=731834

The last sentance under part 1) below should read "So we 

choose our cutoff date to be 5/13/2003."

----------------------------------------------------------------------

Comment By: Ryan Malayter (rmalayter)
Date: 2003-09-15 08:35

Message:
Logged In: YES 
user_id=731834

Since I initially came up with this possible feature on 

the mailing list, let me add my two cents. I don't think 

throwing out any "super-spam" is the right approach, since 

there might be some useful "almost-spam" information in 

there. A spam might score 100% because it 

contains 'viagra' and 'lowest' and 'price', fine, and we 

already know about those tokens. But the same "super-

spammy" message might contian a new domain name, or a new 

word like "silagra"; basically any other information that 

is useful in the training database.

That said, I think a good algorithm might be based on 

dates, to make sure the sampling is representative. I 

suggest looking at the received date of the oldest message 

in each corpus, and choosing the most recent of these 

dates. Then we can count all messages from each corpus 

that are newer than this date, and finally, take a random 

subsample of the messages from the corpus which has "more" 

new messages. The subsampling can be done on the fly by 

using an RNG, you might get an error of a few messages in 

each direction, but it won't affect the statistics 

materially and will be easier to implement than keeping 

track of a bunch of message-ids.

An example of my proposal:

1) Spam corpus: 1342, oldest is dated 5/13/2003; Ham 

corpus: 6203, oldest is dated 6/19/2002. So we choose our 

cutoff date to be 5/13/2002.

2) We already know there are 1342 messages in the spam 

corpus newer than this date. We also count up 2987 

messages in the ham corpus newer than this date. So we 

want to choose 1342/2987=46.324% of the messages from the 

ham corpus newer than 5/13/2003.

3) We tokenize and traing on the whole spam corups. Then 

we start through the ham corpus, skipping all messages 

older than 5/13/2003. If we come across a message newer 

than that, we choose a random number between 0 and 1. If 

the random number is less than 0.46324, we train wiht the 

message. At most we should be off by a few dozen messages 

from the desired 1342 trained ham.

This method gives us a balanced training set, with 

representative spam and ham messages from the same time-

frame. What do you think?

Regards,

   -Ryan-

----------------------------------------------------------------------

Comment By: Leonid (leobru)
Date: 2003-09-12 23:02

Message:
Logged In: YES 
user_id=790676

I don't know if it is a generally good idea or not, but I

forward everything that scores as 1.00 spam directly to

/dev/null (this way there is no way to train on it). This

effectively implements the idea "do not train on VERY spammy

spam". Works for me; about 80% of all messages (or 90% of

all spam) is immediately thrown away, and the ham/spam

numbers do not get skewed. 3 months, and not a single

non-spam mass mailing in my spam box (in "unsure" in the

worst case). 

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-09-08 09:09

Message:
Logged In: YES 
user_id=14198

This isn't Outlook specific, so you can have it back :)  The

big problem I see is *what* ones to choose?  Skipping spam

may be possible, but skipping a single ham to train on could

be a huge problem.

Maybe we could train on all spam, then score all spam, then

re-train using only the least spammy spam - but I think the

answer to

http://spambayes.sourceforge.net/faq.html#why-don-t-you-implement-cool-tokenizer-trick-x

may be relevant <wink>

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=802341&group_id=61702