[spambayes-dev] imbalance within ham or spam training sets?

Mon Nov 3 12:54:01 EST 2003

We know some problems arise if grossly different numbers of ham or spam
exist in the training databases.  I wonder if there might be problems within
datasets if different numbers of particular hams or spams have been used in
the training.

That's probably not worded well.  Let me demonstrate with a concrete
example.  Suppose I've trained on exactly 1000 ham and 1000 spam, just to
eliminate that source of problems.  Within the 1000 hams, suppose I've
trained on 800 python messages, 100 messages about cars and 100 messages
about pop psychology.  We know that if I get a message about a subject which
I've never trained on before (say, woodworking) that there are likely to be
topic-specific clues I've never seen which won't contribute to scoring the
message as ham ("router", "lathe", "sawdust", ...).

Questions:

    * How many woodworking messages will I need to train as ham to get the
      system to properly recognize those messages as ham?  Would that large
      glut of python-related messages hamper the ability of the classifier
      to detect woodworking messages as ham?

    * Similarly, would the 8:1 ratio of python messages to messages about
      cars or pop psychology have an effect on scoring any of those messages
      accurately?

Skip