[Spambayes] rmspik results

Rob Hooft rob@hooft.net
Sun, 06 Oct 2002 20:43:30 +0200


As promised, here are some of my results from the current version of 
rmspik.py. For the record: I just wrote in a previous message:

I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am 
now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second
test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis.

These two tests I did with clt1, clt2 and with clt3 resulting in 6 pik 
files that I analysed using rmspik.py. This results in such a mass of 
results that I wrote a quick script to make a "score" out of each run, 
something that weighs the work of filtering unsure messages, the 
occurrence of fp's and the occurrence of fn's. The score is done using:

         fprate=float(nfp)/nham
         fnrate=float(nfn)/nspam
         unsurerate=float(nunsure)/ntot
         score=fprate*fpfac+fnrate*fnfac+unsurerate*unsurefac

Where: fpfac=3000.0; fnfac=300.0; unsurefac=100.0 representing one 
possible "private" mix of priorities (you could think of these as the 
cost in Euros or Dollars for such a mistake). For a mailing list a 
philosophy tells me fnfac/unsurefac should be about the number of 
members of the list, and fp's are not too bad if you can send a nice 
message to the poster telling him what happened and how to get his 
message posted anyway.

The score is the last number on each line describing a run.

surefactor=1000 pmin(sp|h)amsure=0.01 usetail=False medianoffset=False

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7745    228    21      6   2683     104    13      0  5.6
clt1-67890   7738    225    33      4   2680     108     8      4  5.4
clt2-12345   7814    155    26      5   2690     101     9      0  4.6
clt2-67890   7781    180    34      5   2714      75     8      3  4.9
clt3-12345   7751    211    32      6   2681     110     9      0  5.6
clt3-67890   7704    256    35      5   2699      91     7      3  5.8

With pminhamsure=0.005 and pminspamsure=0.02

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7746    227    21      6   2671     116    13      0  5.7
clt1-67890   7738    225    34      3   2670     118     8      4  5.1
clt2-12345   7835    134    26      5   2673     118     9      0  4.5
clt2-67890   7822    139    34      5   2693      96     8      3  4.8
clt3-12345   7783    179    33      5   2665     126     9      0  5.1
clt3-67890   7752    208    37      3   2672     118     7      3  4.9

With surefactor=10000

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7492    481    23      4   2618     169    13      0  7.9
clt1-67890   7481    482    34      3   2601     187     9      3  8.0
clt2-12345   7810    159    27      4   2644     147     9      0  4.7
clt2-67890   7792    169    35      4   2660     129     8      3  5.0
clt3-12345   7743    219    34      4   2640     151     9      0  5.3
clt3-67890   7717    243    37      3   2643     147     7      3  5.5

With surefactor=10

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7922     51    19      8   2733      54    11      2  4.5
clt1-67890   7905     39    32      5   2749      39     7      5  3.5
clt2-12345   7864    105    25      6   2683     108     8      1  4.6
clt2-67890   7852    109    33      6   2701      88     7      4  4.9
clt3-12345   7814    148    33      5   2675     116     9      0  4.7
clt3-67890   7779    181    36      4   2679     111     6      4  5.0

With surefactor=100, usetail=True

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7777    192    24      7   2705      83    12      0  5.5
clt1-67890   7791    171    34      4   2704      84     8      4  4.7
clt2-12345   7824    143    28      5   2695      96     9      0  4.4
clt2-67890   7668    277    47      8   2731      62     4      3  6.9
clt3-12345   7802    165    28      5   2692      99     9      0  4.7
clt3-67890   7636    309    48      7   2727      66     4      3  6.9

With surefactor=100, usetail=True, medianoffset=True

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7728    241    24      7   2704      84    12      0  6.0
clt1-67890   7753    210    33      4   2695      93     9      3  5.0
clt2-12345   7813    154    28      5   2699      92     9      0  4.5
clt2-67890   7653    292    48      7   2733      60     4      3  6.7
clt3-12345   7803    164    28      5   2693      98     9      0  4.6
clt3-67890   7636    307    50      7   2728      65     4      3  6.9

With surefactor=100, usetail=False, medianoffset=True

expt sets    ham-OK  Unsure UnsNOK  ERR spam-OK  Unsure UnsNOK  ERR
clt1-12345   7842    127    25      6   2672     116    12      0  4.8
clt1-67890   7832    131    34      3   2675     113     8      4  4.2
clt2-12345   7816    147    32      5   2675     116     9      0  4.7
clt2-67890   7786    174    36      4   2684     106     6      4  4.9
clt3-12345   7807    156    32      5   2673     118     9      0  4.8
clt3-67890   7774    187    35      5   2684     106     6      4  5.4

Conclusions so far:
  - medianoffset=True helps
  - usetail=False is better than True
  - clt1 seems to do best, although the difference is not large.
  - there are large differences between the 12345 and 67890 runs.

I'm sure that systematic variation of the parameters (e.g. using a 
simplex optimization?) will give me even better scores.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/