[spambayes-dev] Another incremental training idea...
Eli Stevens (WG.c)
listsub at wickedgrey.com
Wed Jan 21 14:52:48 EST 2004
Tony Meyer wrote:
> [Tony Meyer]
>
>>I tried almost this with the incremental regime, using a maximum of
>>2::1 or 1::2. It did pretty consistently worse than the
>>basic nonedge regime. The only difference is that I didn't choose
>>which messages to use if an imbalance would be created. The idea
>>was basically to do nonedge, except if there was an imbalance, and
>>then only train messages that move the balance closer to 1::1.
>>
>
> [Eli Stevens]
>
>>It sounds like you are saying that non-edge messages on the
>>heavy side were not trained. It seems that would be a key difference.
>>Was that the case in your test?
>>
>
> I'm not sure what you mean by "on the heavy side". Do you mean that scored
> closest to the edge? If so, then yes. Basically, it dealt with messages as
> they arrived, one-by-one, just as an automated system would.
Sorry, I wasn't being very clear. Hmm. Below, isEdge( score, type )
can either be a fixed cutoff, or slide based on how imbalanced things
are, but that is orthogonal to my question.
type, score = classify( msg )
if not isEdge( score, type ):
train( msg, type )
elif more_ham_than_spam and type == spam:
train( msg, type )
elif more_spam_than_ham and type == ham:
train( msg, type )
# Corpus contains:
# MoreHam Balanced MoreSpam
# MsgIsEdgeHam Train
# MsgIsHam Train Train Train
# MsgIsSpam Train Train Train
# MsgIsEdgeSpam Train
Versus:
type, score = classify( msg )
if more_ham_than_spam and type == spam:
train( msg, type )
elif more_spam_than_ham and type == ham:
train( msg, type )
elif not isEdge( score, type ): # implies balanced
train( msg, type )
# Corpus contains:
# MoreHam Balanced MoreSpam
# MsgIsEdgeHam Train
# MsgIsHam Train Train
# MsgIsSpam Train Train
# MsgIsEdgeSpam Train
In the first, training on non-edge takes priority over balance, while in
the second, balance takes priority over training on non-edge. The only
difference is how non-edge messages that would increase imbalance are
treated.
The second is how I interpreted your description, and is what I meant by
non-edge messages on the heavy side not being trained (heavy side =
message type with the most trained messages = what we are imbalanced
towards... It was a poor choice of words).
Does that make sense? I'm still not sure if this is putting it clearly. :/
Eli
More information about the spambayes-dev
mailing list