[Spambayes] training on very small ham sets, normal sized spamsets.

T. Alexander Popiel popiel@wolfskeep.com
Tue Oct 29 18:41:24 2002


In message:  <200210290125.g9T1Ppw09085@localhost.localdomain>
             Anthony Baxter <anthony@interlink.com.au> writes:

>So I hacked on timcv.py and msgs.py to add options 'spam-test', 
>'spam-train', 'ham-test' and 'ham-train', to allow you to set 
>the training set size separately to the testing set size.
>I haven't checked this in because it will break everyone's 
>test scripts - --spam= will no longer be distinct, and getopt
>will gripe. Let me know if I should check this in anyway - I 
>think it's useful, but YMMV.

I'd like to have it. :-)

>The numbers for each (001:, 002:, 003:, 005:, 010:, 015:, 020:) are 
>actually averages of 4 different runs for each, with different 
>-s options on each one (same set of 4 -s used for each, tho). 
>Otherwise the variation was just too damn high. It's still a little 
>'bloopy' - the unsure bounces around a bit, but it's not bad.

Cool.  Good to see someone more thorough than I am... I've
been getting(?) sloppy.  I'm not a real statistician, and
it shows.

>Here's the summary-summary table:
>ham-train  bestcost  realcost    fp%   fn% unsure%
>        1    430.80  11498.75  56.70  0.00   26.46
>       10    274.05   3345.10  15.76  0.03   32.06
>       20    245.50   1855.80   8.61  0.03   22.18
>       30    242.15   1642.90   7.64  0.00   19.23
>       40    234.40   1154.45   5.31  0.00   15.33
>       60    225.55    725.65   3.35  0.03    9.23
>      100    221.05    532.40   2.46  0.03    6.61
>      150    218.60    410.30   1.91  0.08    4.51
>      200    179.90    199.45   0.88  0.10    3.91
>      250    130.05    138.05   0.58  0.08    3.72
>      300     96.80    104.25   0.41  0.15    3.38
>      350     66.75     73.45   0.26  0.17    3.20
>      400     63.25     69.65   0.25  0.20    2.94
>      450     61.95     61.95   0.21  0.28    2.78
>      500     52.50     58.05   0.20  0.23    2.63
>      600     44.15     50.00   0.16  0.23    2.54
>      700     37.75     41.60   0.12  0.28    2.31
>     1000     26.20     27.80   0.06  0.28    2.09
>     1500     19.60     24.40   0.03  0.45    2.48
>     2000     15.50     20.70   0.00  0.45    2.70
>     2500     15.60     21.90   0.00  0.43    2.94
>     2700     20.60     22.80   0.00  0.50    2.97
>
>It seems like most of the wins come once you get up around 350, the
>number of spam trained on. The unsure bucket actually gets a bit worse
>as more ham is added - looking at the histograms, various bits of spam
>are dragged downwards.

Beautiful.  It looks like the excess ham only starts hurting
unsures after about 1000 (or about 3:1).

- Alex