[Spambayes] varying proportion of spam:ham testing.

Anthony Baxter anthony@interlink.com.au
Mon, 30 Sep 2002 18:33:07 +1000


Ok, as promised last week, here's the results for me of
varying the proportion of spam:ham in my personal spam/ham
corpus (note that this is not the monster corpus, but a new
one from my personal email).

It's currently set up as a 4-way split, there's 210 spam 
and 2500 ham in each Set. 

I ran timcv with --spam=200 in each, and 
--ham=200
--ham=400
--ham=1000
--ham=2000

Standard settings, aside from
spam_cutoff: 0.58
mine_received_headers: True

count_all_header_lines, as reported last week, makes the results
considerably worse, so is left off.


==> ant_1:1s.txt <==
-> <stat> Ham scores for all runs: 800 items; mean 24.58; sdev 10.16
-> <stat> Spam scores for all runs: 800 items; mean 80.11; sdev 7.99
-> best cutoff for all runs: 0.575
->     with weighted total 1*3 fp + 7 fn = 10
->     fp rate 0.375%  fn rate 0.875%
total unique false pos 3
total unique false neg 8
average fp % 0.375
average fn % 1.0

==> ant_2:1s.txt <==
-> <stat> Ham scores for all runs: 1600 items; mean 23.18; sdev 9.17
-> <stat> Spam scores for all runs: 800 items; mean 79.29; sdev 8.58
-> best cutoff for all runs: 0.55
->     with weighted total 1*2 fp + 8 fn = 10
->     fp rate 0.125%  fn rate 1%
total unique false pos 1
total unique false neg 15
average fp % 0.0625
average fn % 1.875

==> ant_5:1s.txt <==
-> <stat> Ham scores for all runs: 4000 items; mean 21.02; sdev 8.44
-> <stat> Spam scores for all runs: 800 items; mean 77.80; sdev 9.45
-> best cutoff for all runs: 0.525
->     with weighted total 1*8 fp + 10 fn = 18
->     fp rate 0.2%  fn rate 1.25%
total unique false pos 2
total unique false neg 23
average fp % 0.05
average fn % 2.875

==> ant_10:1s.txt <==
-> <stat> Ham scores for all runs: 8000 items; mean 19.46; sdev 7.89
-> <stat> Spam scores for all runs: 800 items; mean 76.09; sdev 10.45
-> best cutoff for all runs: 0.5
->     with weighted total 1*8 fp + 11 fn = 19
->     fp rate 0.1%  fn rate 1.38%
total unique false pos 2
total unique false neg 53
average fp % 0.025
average fn % 6.625

So increasing the proportion of spam:ham drags down the ham mean and
std dev, but also drags down the spam mean and stddev.

Anthony