[spambayes-dev] imbalance within ham or spam training sets?

Mon Nov 3 14:09:04 EST 2003

    >> * How many woodworking messages will I need to train as ham to get
    >>   the system to properly recognize those messages as ham?  Would that
    >>   large glut of python-related messages hamper the ability of the
    >>   classifier to detect woodworking messages as ham?

    Kenny> I would think one would be sufficient, assuming of course that
    Kenny> none of the words in your woodworking message already appear in
    Kenny> your *spam* training.  SpamBayes only considers tokens that are
    Kenny> *in* the message being classified, not tokens that are *not in*
    Kenny> the message.  So, regardless of how many times a token has
    Kenny> appeared in the python messages, it will not even be considered
    Kenny> in the scoring if it does not appear in the woodworking message.
    Kenny> On the other hand, if that token *does* appear in the woodworking
    Kenny> message then it will be solidly scored as ham and therefore
    Kenny> increase the probability of the message being correctly
    Kenny> classified.

Let me rephrase the question again.  There's a discussion in Gary Robinson's
LJ article

    http://www.linuxjournal.com/article.php?sid=6467

about dealing with rare words which I didn't really follow.  If I've trained
on 1000 other ham messages and now encounter a woodworking message, some of
the words in there are likely to have not been seen before ("lathe", for
example).  Such words obviously can't contribute to scoring that message.
Let's assume I then train that message as ham.  "lathe" now has a hamcount
of 1 and a spamcount of 0.  It is a "rare word".  How many more messages
which contain "lathe" do I have to train on before it is no longer "rare".
In particular, by training on 1000 other hams which don't contain that word,
have I somehow created an artificial barrier to getting woodworking-specific
words to have full effect as ham indicators?

If there is a problem, it might be fairly easy to fall into a trap which is
a bit difficult to get out of.  Suppose I'm starting from scratch and I know
I have several mailboxes:

    * python - 800 messages
    * cars - 100 messages
    * pop-psycology - 100 messages
    * spam - 1000 messages

As a new user, it might be very easy for me to ask SB to score all messages
in the first three mailboxes as ham and all in the fourth as spam, thus
creating a problem (if one exists).  *If* such a problem exists (and it very
well may not), it might be better if I could tell the system to pick a
random sample of each of my collections such that the relative number of
hams and spams is about equal and so that the imbalance between mailboxes
classified as ham or spam is not too great either.

Skip