[Spambayes] rmspik results
Rob Hooft
rob@hooft.net
Sun, 06 Oct 2002 20:43:30 +0200
As promised, here are some of my results from the current version of
rmspik.py. For the record: I just wrote in a previous message:
I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am
now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second
test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis.
These two tests I did with clt1, clt2 and with clt3 resulting in 6 pik
files that I analysed using rmspik.py. This results in such a mass of
results that I wrote a quick script to make a "score" out of each run,
something that weighs the work of filtering unsure messages, the
occurrence of fp's and the occurrence of fn's. The score is done using:
fprate=float(nfp)/nham
fnrate=float(nfn)/nspam
unsurerate=float(nunsure)/ntot
score=fprate*fpfac+fnrate*fnfac+unsurerate*unsurefac
Where: fpfac=3000.0; fnfac=300.0; unsurefac=100.0 representing one
possible "private" mix of priorities (you could think of these as the
cost in Euros or Dollars for such a mistake). For a mailing list a
philosophy tells me fnfac/unsurefac should be about the number of
members of the list, and fp's are not too bad if you can send a nice
message to the poster telling him what happened and how to get his
message posted anyway.
The score is the last number on each line describing a run.
surefactor=1000 pmin(sp|h)amsure=0.01 usetail=False medianoffset=False
expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR
clt1-12345 7745 228 21 6 2683 104 13 0 5.6
clt1-67890 7738 225 33 4 2680 108 8 4 5.4
clt2-12345 7814 155 26 5 2690 101 9 0 4.6
clt2-67890 7781 180 34 5 2714 75 8 3 4.9
clt3-12345 7751 211 32 6 2681 110 9 0 5.6
clt3-67890 7704 256 35 5 2699 91 7 3 5.8
With pminhamsure=0.005 and pminspamsure=0.02
expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR
clt1-12345 7746 227 21 6 2671 116 13 0 5.7
clt1-67890 7738 225 34 3 2670 118 8 4 5.1
clt2-12345 7835 134 26 5 2673 118 9 0 4.5
clt2-67890 7822 139 34 5 2693 96 8 3 4.8
clt3-12345 7783 179 33 5 2665 126 9 0 5.1
clt3-67890 7752 208 37 3 2672 118 7 3 4.9
With surefactor=10000
expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR
clt1-12345 7492 481 23 4 2618 169 13 0 7.9
clt1-67890 7481 482 34 3 2601 187 9 3 8.0
clt2-12345 7810 159 27 4 2644 147 9 0 4.7
clt2-67890 7792 169 35 4 2660 129 8 3 5.0
clt3-12345 7743 219 34 4 2640 151 9 0 5.3
clt3-67890 7717 243 37 3 2643 147 7 3 5.5
With surefactor=10
expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR
clt1-12345 7922 51 19 8 2733 54 11 2 4.5
clt1-67890 7905 39 32 5 2749 39 7 5 3.5
clt2-12345 7864 105 25 6 2683 108 8 1 4.6
clt2-67890 7852 109 33 6 2701 88 7 4 4.9
clt3-12345 7814 148 33 5 2675 116 9 0 4.7
clt3-67890 7779 181 36 4 2679 111 6 4 5.0
With surefactor=100, usetail=True
expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR
clt1-12345 7777 192 24 7 2705 83 12 0 5.5
clt1-67890 7791 171 34 4 2704 84 8 4 4.7
clt2-12345 7824 143 28 5 2695 96 9 0 4.4
clt2-67890 7668 277 47 8 2731 62 4 3 6.9
clt3-12345 7802 165 28 5 2692 99 9 0 4.7
clt3-67890 7636 309 48 7 2727 66 4 3 6.9
With surefactor=100, usetail=True, medianoffset=True
expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR
clt1-12345 7728 241 24 7 2704 84 12 0 6.0
clt1-67890 7753 210 33 4 2695 93 9 3 5.0
clt2-12345 7813 154 28 5 2699 92 9 0 4.5
clt2-67890 7653 292 48 7 2733 60 4 3 6.7
clt3-12345 7803 164 28 5 2693 98 9 0 4.6
clt3-67890 7636 307 50 7 2728 65 4 3 6.9
With surefactor=100, usetail=False, medianoffset=True
expt sets ham-OK Unsure UnsNOK ERR spam-OK Unsure UnsNOK ERR
clt1-12345 7842 127 25 6 2672 116 12 0 4.8
clt1-67890 7832 131 34 3 2675 113 8 4 4.2
clt2-12345 7816 147 32 5 2675 116 9 0 4.7
clt2-67890 7786 174 36 4 2684 106 6 4 4.9
clt3-12345 7807 156 32 5 2673 118 9 0 4.8
clt3-67890 7774 187 35 5 2684 106 6 4 5.4
Conclusions so far:
- medianoffset=True helps
- usetail=False is better than True
- clt1 seems to do best, although the difference is not large.
- there are large differences between the 12345 and 67890 runs.
I'm sure that systematic variation of the parameters (e.g. using a
simplex optimization?) will give me even better scores.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/