[Spambayes] varying proportion of spam:ham testing.
Anthony Baxter
anthony@interlink.com.au
Mon, 30 Sep 2002 18:33:07 +1000
Ok, as promised last week, here's the results for me of
varying the proportion of spam:ham in my personal spam/ham
corpus (note that this is not the monster corpus, but a new
one from my personal email).
It's currently set up as a 4-way split, there's 210 spam
and 2500 ham in each Set.
I ran timcv with --spam=200 in each, and
--ham=200
--ham=400
--ham=1000
--ham=2000
Standard settings, aside from
spam_cutoff: 0.58
mine_received_headers: True
count_all_header_lines, as reported last week, makes the results
considerably worse, so is left off.
==> ant_1:1s.txt <==
-> <stat> Ham scores for all runs: 800 items; mean 24.58; sdev 10.16
-> <stat> Spam scores for all runs: 800 items; mean 80.11; sdev 7.99
-> best cutoff for all runs: 0.575
-> with weighted total 1*3 fp + 7 fn = 10
-> fp rate 0.375% fn rate 0.875%
total unique false pos 3
total unique false neg 8
average fp % 0.375
average fn % 1.0
==> ant_2:1s.txt <==
-> <stat> Ham scores for all runs: 1600 items; mean 23.18; sdev 9.17
-> <stat> Spam scores for all runs: 800 items; mean 79.29; sdev 8.58
-> best cutoff for all runs: 0.55
-> with weighted total 1*2 fp + 8 fn = 10
-> fp rate 0.125% fn rate 1%
total unique false pos 1
total unique false neg 15
average fp % 0.0625
average fn % 1.875
==> ant_5:1s.txt <==
-> <stat> Ham scores for all runs: 4000 items; mean 21.02; sdev 8.44
-> <stat> Spam scores for all runs: 800 items; mean 77.80; sdev 9.45
-> best cutoff for all runs: 0.525
-> with weighted total 1*8 fp + 10 fn = 18
-> fp rate 0.2% fn rate 1.25%
total unique false pos 2
total unique false neg 23
average fp % 0.05
average fn % 2.875
==> ant_10:1s.txt <==
-> <stat> Ham scores for all runs: 8000 items; mean 19.46; sdev 7.89
-> <stat> Spam scores for all runs: 800 items; mean 76.09; sdev 10.45
-> best cutoff for all runs: 0.5
-> with weighted total 1*8 fp + 11 fn = 19
-> fp rate 0.1% fn rate 1.38%
total unique false pos 2
total unique false neg 53
average fp % 0.025
average fn % 6.625
So increasing the proportion of spam:ham drags down the ham mean and
std dev, but also drags down the spam mean and stddev.
Anthony