[spambayes-dev] Another incremental training idea...

Wed Jan 21 14:52:48 EST 2004

Tony Meyer wrote:

> [Tony Meyer]
> 
>>I tried almost this with the incremental regime, using a maximum of 
>>2::1 or 1::2.  It did pretty consistently worse than the 
>>basic nonedge regime.  The only difference is that I didn't choose
>>which messages to use if an imbalance would be created.  The idea
>>was basically to do nonedge, except if there was an imbalance, and
>>then only train messages that move the balance closer to 1::1.
>>
> 
> [Eli Stevens]
> 
>>It sounds like you are saying that non-edge messages on the 
>>heavy side were not trained.  It seems that would be a key difference.
>>Was that the case in your test?
>>
> 
> I'm not sure what you mean by "on the heavy side".  Do you mean that scored
> closest to the edge?  If so, then yes.  Basically, it dealt with messages as
> they arrived, one-by-one, just as an automated system would.

Sorry, I wasn't being very clear.  Hmm.  Below, isEdge( score, type ) 
can either be a fixed cutoff, or slide based on how imbalanced things 
are, but that is orthogonal to my question.

type, score = classify( msg )

if not isEdge( score, type ):
     train( msg, type )
elif more_ham_than_spam and type == spam:
     train( msg, type )
elif more_spam_than_ham and type == ham:
     train( msg, type )

# Corpus contains:
#               MoreHam Balanced MoreSpam
# MsgIsEdgeHam                   Train
# MsgIsHam      Train   Train    Train
# MsgIsSpam     Train   Train    Train
# MsgIsEdgeSpam Train

Versus:

type, score = classify( msg )

if more_ham_than_spam and type == spam:
     train( msg, type )
elif more_spam_than_ham and type == ham:
     train( msg, type )
elif not isEdge( score, type ): # implies balanced
     train( msg, type )

# Corpus contains:
#               MoreHam Balanced MoreSpam
# MsgIsEdgeHam                   Train
# MsgIsHam              Train    Train
# MsgIsSpam     Train   Train
# MsgIsEdgeSpam Train

In the first, training on non-edge takes priority over balance, while in 
the second, balance takes priority over training on non-edge.  The only 
difference is how non-edge messages that would increase imbalance are 
treated.

The second is how I interpreted your description, and is what I meant by 
non-edge messages on the heavy side not being trained (heavy side = 
message type with the most trained messages = what we are imbalanced 
towards...  It was a poor choice of words).

Does that make sense?  I'm still not sure if this is putting it clearly.  :/

Eli