[Spambayes] Effects of training set size

T. Alexander Popiel popiel@wolfskeep.com
Fri, 04 Oct 2002 15:15:17 -0700


Executive summary: Increasing the training set size helps, but
not as much as one might think.  Specifically, the ham/spam
means spread apart, but the error rates stay fairly constant.
More data improves classification of ham, but it seems that
a _very_ small sample of spam (200 messages) is enough to
represent it.


I'm running with everything at defaults, which means I'm using
the Robinson classifier, spam_cutoff of 0.560, x = 0.5, s = 0.45,
et cetera, et cetera, ad nauseum.  I have about 3000 spam and
nearly 2000 ham, representing everything from my own personal
mail feed since 22 Aug 2002 (when I stopped throwing away a
significant portion of my ham).  I should have a full 2000 ham
in another day or two, at which point I'll probably redo my
data directories.

I did cross-validation (via timcv.py) using --ham-keep and
--spam-keep at each of 50, 70, 90, 110, 130, 150, 170, and 190.
This means that I used training corpus sizes of 200, 280, 360,
440, 520, 600, 680, and 760 hams and spams, testing against the
smaller numbers of messages.

I used the following adaptation of runtest.sh:

"""
#! /bin/sh -x
##
## runsizes.sh -- run some tests for Tim
##
## This does everything you need to test yer data.  You may want to skip
## the rebal steps if you've recently moved some of your messages
## (because they were in the wrong corpus) or you may suffer my fate and
## get stuck forever re-categorizing email.
##
## Just set up your messages as detailed in README.txt; put them all in
## the reservoir directories, and this script will take care of the
## rest.  Paste the output (also in results.txt) to the mailing list for
## good karma.
##
## Neale Pickett <neale@woozle.org>
##

if [ "$1" = "-r" ]; then
    REBAL=1
    shift
fi

# Number of messages per rebalanced set
RNUM=190

# Number of sets
SETS=5

# Seed for random number generator
SEED=13666

if [ -n "$REBAL" ]; then
    # Put them all into reservoirs
    python2.2 rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n 0 -Q
    python2.2 rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n 0 -Q
    # Rebalance
    python2.2 rebal.py -r Data/Ham/reservoir -s Data/Ham/Set -n $RNUM -Q
    python2.2 rebal.py -r Data/Spam/reservoir -s Data/Spam/Set -n $RNUM -Q
fi

for keep in 50 70 90 110 130 150 170 190; do
    python2.2 timcv.py -n $SETS --ham-keep $keep --spam-keep $keep -s $SEED > run$keep.txt
done

for k1 in 50 70 90 110 130 150 170; do
    k2=`echo $k1 20 + p | dc`
    python2.2 rates.py run$k1 run$k2 > runrates$k1.txt
    python2.2 cmp.py run${k1}s run${k2}s | tee results$k1.txt
done

for k1 in 50 70 90 110 130 150 170; do
    k2=190
    python2.2 rates.py run$k1 run$k2 > runrates${k1}-190.txt
    python2.2 cmp.py run${k1}s run${k2}s | tee results${k1}-190.txt
done
"""

I then hand-munged the results output to reveal:

keep:     50      70      90     110     130     150     170     190
fp %:           (meaningless, only 1 or 2 fp in any run)
fn %:    3.20    4.57    4.00    4.36    4.15    3.20    3.53    4.53
h mean: 25.28   24.38   22.19   21.35   21.21   20.91   20.37   19.50
h sdev:  7.45    7.56    6.86    6.89    7.05    6.92    6.87    6.81
s mean: 74.21   74.54   73.65   73.92   74.63   74.99   74.81   74.52
s sdev:  8.56    9.10    8.84    9.13    8.98    8.76    8.62    8.99
mean difference:
        48.93   50.16   51.46   52.57   53.42   54.08   54.44   55.02

I'm not sure if the fn % are significant, and they're jumping
enough for me to suspect they're not.  No obvious trend there,
anyway.

The ham mean drifted down steadily with more data, and the spam
mean held fairly constant with a very slight upward drift.
Ham sdev seems to get slowly tighter, with spam sdev jiggling
in no particularly obvious direction.

Finally, the difference in means steadily increased, echoing the
downward drift of the ham mean.

All of the reports are available at:
  http://www.wolfskeep.com/~popiel/spambayes/trainsize


My next experiment: try this all again with --ham-keep constant
and only --spam-keep variable. :-)

- Alex