[Spambayes] training on very small ham sets, normal sized
spamsets.
T. Alexander Popiel
popiel@wolfskeep.com
Tue Oct 29 18:41:24 2002
In message: <200210290125.g9T1Ppw09085@localhost.localdomain>
Anthony Baxter <anthony@interlink.com.au> writes:
>So I hacked on timcv.py and msgs.py to add options 'spam-test',
>'spam-train', 'ham-test' and 'ham-train', to allow you to set
>the training set size separately to the testing set size.
>I haven't checked this in because it will break everyone's
>test scripts - --spam= will no longer be distinct, and getopt
>will gripe. Let me know if I should check this in anyway - I
>think it's useful, but YMMV.
I'd like to have it. :-)
>The numbers for each (001:, 002:, 003:, 005:, 010:, 015:, 020:) are
>actually averages of 4 different runs for each, with different
>-s options on each one (same set of 4 -s used for each, tho).
>Otherwise the variation was just too damn high. It's still a little
>'bloopy' - the unsure bounces around a bit, but it's not bad.
Cool. Good to see someone more thorough than I am... I've
been getting(?) sloppy. I'm not a real statistician, and
it shows.
>Here's the summary-summary table:
>ham-train bestcost realcost fp% fn% unsure%
> 1 430.80 11498.75 56.70 0.00 26.46
> 10 274.05 3345.10 15.76 0.03 32.06
> 20 245.50 1855.80 8.61 0.03 22.18
> 30 242.15 1642.90 7.64 0.00 19.23
> 40 234.40 1154.45 5.31 0.00 15.33
> 60 225.55 725.65 3.35 0.03 9.23
> 100 221.05 532.40 2.46 0.03 6.61
> 150 218.60 410.30 1.91 0.08 4.51
> 200 179.90 199.45 0.88 0.10 3.91
> 250 130.05 138.05 0.58 0.08 3.72
> 300 96.80 104.25 0.41 0.15 3.38
> 350 66.75 73.45 0.26 0.17 3.20
> 400 63.25 69.65 0.25 0.20 2.94
> 450 61.95 61.95 0.21 0.28 2.78
> 500 52.50 58.05 0.20 0.23 2.63
> 600 44.15 50.00 0.16 0.23 2.54
> 700 37.75 41.60 0.12 0.28 2.31
> 1000 26.20 27.80 0.06 0.28 2.09
> 1500 19.60 24.40 0.03 0.45 2.48
> 2000 15.50 20.70 0.00 0.45 2.70
> 2500 15.60 21.90 0.00 0.43 2.94
> 2700 20.60 22.80 0.00 0.50 2.97
>
>It seems like most of the wins come once you get up around 350, the
>number of spam trained on. The unsure bucket actually gets a bit worse
>as more ham is added - looking at the histograms, various bits of spam
>are dragged downwards.
Beautiful. It looks like the excess ham only starts hurting
unsures after about 1000 (or about 3:1).
- Alex