[Spambayes] SpamBayes old and new

Thu Jan 8 16:29:34 EST 2004

[followups to spambayes-dev at python.org please, since it would get
 increasingly technical beyond this point]

[Simone Piunno]
>> Just out of curiosity, I've read this essay by Greg Louis:
>>
>>    http://www.bgl.nu/bogofilter/bayes.html
>>
>> I find it has interesting considerations on the balance problem.
>> Did you know this essay?  Have you ever tried how it works?

[Tim Peters]
> ...
> Alex here did a relevant experiment, but the report is lacking some
> needed detail:

http://mail.python.org/pipermail/spambayes-dev/2003-November/001592.html

I ran a test on my own recent email mix, using current Outlook addin
defaults.  "base" is the current code.  "bycount" replaces one line in
classifier.py, from

        prob = spamratio / (hamratio + spamratio)

to

        prob = float(spamcount) / (spamcount + hamcount)

Results are certainly ... remarkable.  Since my incoming email is naturally
unbalanced in a 4::1 ham::spam ratio lately, it's a more interesting test
than Greg's nearly-balanced test:

base -> bycount
-> <stat> tested 528 hams & 130 spams against 4752 hams & 1170 spams
<19 repetitions deleted>

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.189  0.000  won   -100.00%
    0.189  0.000  won   -100.00%
    0.379  0.000  won   -100.00%
    0.000  0.000  tied
    0.000  0.000  tied

won   3 times
tied  7 times
lost  0 times

total unique fp went from 4 to 0 won   -100.00%
mean fp % went from 0.0757575757576 to 0.0 won   -100.00%

false negative percentages
    0.769  16.154  lost  +2000.65%
    0.769  23.077  lost  +2900.91%
    0.000  19.231  lost  +(was 0)
    0.769  23.077  lost  +2900.91%
    0.769  23.846  lost  +3000.91%
    0.769  16.154  lost  +2000.65%
    1.538  26.923  lost  +1650.52%
    0.000  12.308  lost  +(was 0)
    1.538  20.000  lost  +1200.39%
    1.538  17.692  lost  +1050.33%

won   0 times
tied  0 times
lost 10 times

total unique fn went from 11 to 258 lost  +2245.45%
mean fn % went from 0.846153846153 to 19.8461538462 lost  +2245.45%

ham mean                     ham sdev
   0.38    0.00 -100.00%        3.57    0.00 -100.00%
   0.34    0.00 -100.00%        3.70    0.09  -97.57%
   0.07    0.00 -100.00%        0.85    0.00 -100.00%
   0.03    0.00 -100.00%        0.43    0.00 -100.00%
   0.34    0.00 -100.00%        4.08    0.01  -99.75%
   0.26    0.00 -100.00%        4.36    0.00 -100.00%
   0.28    0.00 -100.00%        4.32    0.00 -100.00%
   0.55    0.00 -100.00%        6.44    0.00 -100.00%
   0.28    0.00 -100.00%        3.40    0.00 -100.00%
   0.29    0.00 -100.00%        3.24    0.00 -100.00%

ham mean and sdev for all runs
   0.28    0.00 -100.00%        3.81    0.03  -99.21%

spam mean                    spam sdev
  96.12   63.99  -33.43%       14.01   32.86 +134.55%
  97.15   58.20  -40.09%       12.56   35.04 +178.98%
  97.58   58.34  -40.21%        8.75   34.93 +299.20%
  97.72   58.61  -40.02%       10.38   36.75 +254.05%
  97.07   57.33  -40.94%       11.68   35.77 +206.25%
  97.00   61.26  -36.85%       13.01   33.07 +154.19%
  95.36   55.46  -41.84%       15.45   37.77 +144.47%
  97.54   67.03  -31.28%       10.86   31.88 +193.55%
  96.34   60.80  -36.89%       14.94   34.05 +127.91%
  95.81   60.84  -36.50%       14.94   33.66 +125.30%

spam mean and sdev for all runs
  96.77   60.19  -37.80%       12.86   34.77 +170.37%

ham/spam mean difference: 96.49 60.19 -36.30

filename:         base     bycount
ham:spam:    5280:1300   5280:1300
fp total:            4           0
fp %:             0.08        0.00
fn total:           11         258
fn %:             0.85       19.85
unsure t:          101         660
unsure %:         1.53       10.03
real cost:      $71.20     $390.00
best cost:      $53.00     $147.60
h mean:           0.28        0.00
h sdev:           3.81        0.03
s mean:          96.77       60.19
s sdev:          12.86       34.77
mean diff:       96.49       60.19
k:                5.79        1.73

Overall, since I have a lot more ham than spam now, when computing initial
spamprob guess by raw counts instead of by corpus-relative ratios everything
ends up looking hammier; if I had a lot more spam than ham instead,
everything would end up looking spammier.  As a result of everything looking
hammier, the ham and spam means both plummet, the spam variance skyrockets,
there are fewer false positives, almost-astonishingly more false negatives,
and about half the spam scored as unsure:

Ham: 5280 (100.00%) ok, 0 (0.00%) unsure, 0 (0.00%) fp
Spam: 382 (29.38%) ok, 660 (50.77%) unsure, 258 (19.85%) fn

Every ham was classed as ham (no FP, no unsures), but that was at the
expense of only 30% of the spam getting classed as spam, and 20% of it
getting classed as ham.

So, in all, this experiment agreed with what Alex reported earlier:

> basing the prob on the raw counts instead of the ratios is
> an incredibly clearcut loss.  Only won twice on the false positives
> (by relatively small margins), but lost EVERY time on the false
> negatives by large amounts.

I should note that this test was run against *all* the email I've received
recently, so it's not that the ham::spam ratio used in the test differed
from what I see in real life.