[Spambayes] Effects of ham to spam ratio

Mon, 07 Oct 2002 12:58:51 -0700

Executive summary: more spam is VERY good.  1:4 ham:spam is
_much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam.

I'm back with another unusual experiment.  This time, I varied
the ratio of ham to spam, while keeping the total number of
messages trained and tested constant.  Once again, I'm doing
this using the all-defaults Robinson classifier.  If someone
gives me a good set of .ini files, I'd be more than happy to
run this test using any of the central limit algorithms, too.

I again used timcv.py as my test driver, this time with 200
messages in each ham/spam set.  For the different runs, I used
the --{ham,spam}-keep options to control how much of each set
got used, with the total used always being 250 ham+spam from
each pair.  The script I used (along with all the run output,
etc.) is on my website at:

  http://www.wolfskeep.com/~popiel/spambayes/ratio

I also mangled a version of cmp.py (now called table.py,
also on the website) to generate the following output:

-> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
[... edited for brevity ...]
-> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams

ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
fp tot:          2       1       2       2       3       3       1
fp %:         0.80    0.27    0.40    0.32    0.40    0.34    0.10
fn tot:         12      17      20      28      28      30      36
fn %:         1.20    1.94    2.67    4.48    5.60    8.00   14.40
h mean:      28.80   25.01   22.57   20.83   19.80   18.74   16.59
h sdev:       8.37    7.61    7.09    7.07    7.24    7.24    7.30
s mean:      78.32   76.48   75.05   73.79   72.88   70.96   68.10
s sdev:       7.87    8.36    8.82    9.28    9.77   10.36   10.86
mean diff:   49.52   51.47   52.48   52.96   53.08   52.22   51.51
k:            3.05    3.22    3.30    3.24    3.12    2.97    2.84

There are several interesting things here:

1. The false positive rate remains insignificant throughout.
2. The false negative rate drops significantly as the ham:spam
   ratio goes down.  The more spam you have in your mailfeed,
   the better this whole thing works.
3. The ham:spam ratio affects the spam sdev much more than the
   ham sdev.
4. Tim's k value (mean separation divided by sum of standard
   deviations) is best with slightly less ham than spam (at 2:3),
   which happens to be about the same ratio as in my real mailfeed.

It would be very interesting to find out if the best ham:spam
ratio for k (#4 above) is constant, or if it's actually tied to
the ratio in the real mail feed from which the training data is
taken.  This may be hard to measure for people who are using
corpora augmented from several sources.

- Alex