[spambayes-dev] Another incremental training idea...

Thu Jan 15 11:48:25 EST 2004

[Toby Dickenson]
> Ive finally got the cross validation tools working here, and the
> first thing I looked at was imbalance. My normal training set is
> currently 14k hams and 2k spams. This test compared that imbalance
> against three independantly selected balanced sets with 2k of both.

Well, you're varying both balance and total number of messages in these
tests, so it's hard to pin down the hypothesis it's really testing.  To test
only balance, and if you've got no more than 2K spam, then tests of, e.g.,
1900:100, 1800:200, 1700:300, ..., 300:1700, 200:1800, 100:1900 would vary
balance while keeping total # of messages fixed.  The cv tester's --ham-keep
and --spam-keep options can be used to automatically pick random subsets of
given sizes, btw, without needing to rearrange your data files.

> If Im reading this right, my 7:1 imbalance doesnt hurt me.
>
> filename:    unbal    bal1    bal2    bal3
> ham:spam:  14560:1992      1992:1992
>                    1992:1992       1992:1992
> ...
> fp %:         0.00    0.00    0.05    0.00
> ..
> fn %:         0.60    0.30    0.40    0.30
> ...
> unsure %:     0.62    0.53    0.58    0.73

Whatever this is really testing, the FN *percentage* is worst in the first
column, and the Unsure percentage isn't winning there <wink>.  Since you
kept the total # of spam fixed across all 4 tests, and FN are a subset of
spam, a decrease in FN percentage is also a decrease in FN absolute count.
IOW, if you had trained on less ham, your results show that you would have
gotten fewer false negatives (half to two-thirds of the number you got in
the first column), despite that you train on some 12,000 less ham after the
first column:

fn total:       12       6       8       6

Those columns are all "out of 1992"; no ham *can* be a FN.

> real cost:  $32.40  $10.20  $22.60  $11.80
> best cost:  $27.60   $7.00   $9.80   $8.60

Those two are just misleading when the total # of msgs changes across runs.

> h mean:       0.11    0.23    0.30    0.32
> h sdev:       1.89    2.47    3.46    3.26
> s mean:      96.93   99.06   99.04   99.02
> s sdev:      12.11    6.88    6.98    7.21
> mean diff:   96.82   98.83   98.74   98.70
> k:            6.92   10.57    9.46    9.43

The first column shows a much fuzzier idea of what spam is (spam sdev is
much larger than in the other columns), and k is much smaller -- k is the
number such that hmean + k*hsdev == smean - k*ssdev, and is a measure of
population separation.

Picture the limit:  you train on all ham and no spam.  Then no message can
get classified as spam (no token "looks spammy").  You'll get lots of FN,
and at best a spam will score as Unsure.  The classifier's idea of spam is
extremely fuzzy.  The good news is that you'll get no FP.  Add 1 spam to the
training data, and the situation improves, but probably not by a whole lot.
Etc.  It's possible that best results will be achieved at some (non-insane)
ratio other than 1:1, but almost certain that, if so, the best ratio will
vary across specific email mix.

I *think* you're doing better at 7:1 than most people would.  An FN rate of
0.6% is low enough that I wouldn't bother to change anything in my personal
classifier.  For truly high-volume application, though, the defference
between 0.6% and 0.4% actually is the 50% it looks like it is <wink>.