[Spambayes] First result from Gary Robinson's ideas

18 Sep 2002 15:43:33 -0700

So then, Neale Pickett <neale@woozle.org> is all like:

> total unique fn went from 12 to 1000 lost  +8233.33%
> mean fn % went from 1.2 to 100.0 lost  +8233.33%

Please disregard those results.  This says that every single message in
my spam corpus got tagged as ham with this change.  Investigating, I
found that I had neglegted to remove a line calculating prob after
inserting Tim's new code, so everything was getting a probability of
0.5.  On the positive side, my FP rate went to 0!  ;)

So here's another run with the *right* code change.

The first run, run1, is using Tim's original classifier code.  The
second is using the following modification Tim proposed to implement
Gary's first suggestion:

         prob_product = inverse_prob_product = 1.0
+        P = Q = 1.0
+        num_clues = 0
         for distance, prob, word, record in nbest:
             if prob is None:    # it's one of the dummies nbest started
             with
                 continue
             if record is not None:  # else wordinfo doesn't know about
             it
                 record.killcount += 1
             if evidence:
                 clues.append((word, prob))
-            prob_product *= prob
-            inverse_prob_product *= 1.0 - prob
+            num_clues += 1
+            P *= 1.0 - prob
+            Q *= prob
+
+        if num_clues:
+            P = 1.0 - P**(1./num_clues)
+            Q = 1.0 - Q**(1./num_clues)
+            prob = (P-Q)/(P+Q)  # in -1 .. 1
+            prob = 0.5 + prob/2 # shift to 0 .. 1
+        else:
+            prob = 0.5

-        prob = prob_product / (prob_product + inverse_prob_product)

Additionally, I changed the "spam cutoff" from 0.9 to 0.5.  Comparing
the results before (run1) and after (run2), I get:

"""
run1s -> run2s
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

false positive percentages
    2.000  2.000  tied
    1.500  1.500  tied
    2.000  2.000  tied
    1.000  1.500  lost   +50.00%
    0.500  0.500  tied

won   0 times
tied  4 times
lost  1 times

total unique fp went from 14 to 15 lost    +7.14%
mean fp % went from 1.4 to 1.5 lost    +7.14%

false negative percentages
    1.500  1.000  won    -33.33%
    1.000  1.000  tied
    1.500  1.500  tied
    1.500  1.500  tied
    1.000  1.000  tied

won   1 times
tied  4 times
lost  0 times

total unique fn went from 13 to 12 won     -7.69%
mean fn % went from 1.3 to 1.2 won     -7.69%
"""

So false positives basically stayed the same--the one case where the
false positives got worse, it was only by one message, which I would
imagine is within the margin of error, but I Am Not A Statistician :).

But, as Tim said earlier, what's really interesting is the distribution
of scores across all runs.  The first run, without Gary's modification,
gives me the following distributions:

"""
Ham distribution for all runs:
* = 17 items
  0.00 984 **********************************************************
  2.50   1 *
  5.00   0 
  7.50   0 
 10.00   0 
 12.50   0 
 15.00   0 
 17.50   0 
 20.00   0 
 22.50   0 
 25.00   0 
 27.50   0 
 30.00   0 
 32.50   0 
 35.00   0 
 37.50   0 
 40.00   0 
 42.50   0 
 45.00   0 
 47.50   0 
 50.00   1 *
 52.50   0 
 55.00   0 
 57.50   0 
 60.00   0 
 62.50   0 
 65.00   0 
 67.50   0 
 70.00   0 
 72.50   0 
 75.00   0 
 77.50   0 
 80.00   0 
 82.50   0 
 85.00   0 
 87.50   0 
 90.00   0 
 92.50   0 
 95.00   0 
 97.50  14 *

Spam distribution for all runs:
* = 17 items
  0.00  11 *
  2.50   0 
  5.00   0 
  7.50   0 
 10.00   0 
 12.50   0 
 15.00   0 
 17.50   0 
 20.00   0 
 22.50   0 
 25.00   0 
 27.50   0 
 30.00   0 
 32.50   0 
 35.00   0 
 37.50   0 
 40.00   0 
 42.50   0 
 45.00   0 
 47.50   0 
 50.00   1 *
 52.50   0 
 55.00   0 
 57.50   0 
 60.00   0 
 62.50   0 
 65.00   0 
 67.50   0 
 70.00   0 
 72.50   0 
 75.00   0 
 77.50   0 
 80.00   0 
 82.50   1 *
 85.00   0 
 87.50   0 
 90.00   0 
 92.50   2 *
 95.00   3 *
 97.50 982 **********************************************************
"""

Your typical Grahamian black-or-white picture, with little middle
ground.  With Gary's idea, however, comes many more shades of gray:

"""
Ham distribution for all runs:
* = 12 items
  0.00 681 *********************************************************
  2.50  62 ******
  5.00  18 **
  7.50  10 *
 10.00  14 **
 12.50  33 ***
 15.00  40 ****
 17.50  28 ***
 20.00  14 **
 22.50  22 **
 25.00   5 *
 27.50  11 *
 30.00  13 **
 32.50  10 *
 35.00   6 *
 37.50   5 *
 40.00   3 *
 42.50   5 *
 45.00   5 *
 47.50   0 
 50.00   1 *
 52.50   4 *
 55.00   1 *
 57.50   1 *
 60.00   2 *
 62.50   1 *
 65.00   0 
 67.50   1 *
 70.00   2 *
 72.50   0 
 75.00   0 
 77.50   0 
 80.00   0 
 82.50   0 
 85.00   0 
 87.50   0 
 90.00   0 
 92.50   0 
 95.00   0 
 97.50   2 *

Spam distribution for all runs:
* = 14 items
  0.00   1 *
  2.50   0 
  5.00   0 
  7.50   0 
 10.00   0 
 12.50   0 
 15.00   0 
 17.50   0 
 20.00   0 
 22.50   0 
 25.00   0 
 27.50   0 
 30.00   0 
 32.50   4 *
 35.00   0 
 37.50   1 *
 40.00   1 *
 42.50   1 *
 45.00   3 *
 47.50   0 
 50.00   4 *
 52.50   4 *
 55.00   2 *
 57.50   9 *
 60.00  13 *
 62.50   9 *
 65.00  12 *
 67.50  14 *
 70.00   9 *
 72.50  15 **
 75.00   8 *
 77.50   7 *
 80.00  11 *
 82.50  12 *
 85.00   9 *
 87.50   0 
 90.00   1 *
 92.50   2 *
 95.00  13 *
 97.50 835 ************************************************************
"""

So from my perspective (and again, IANAS) it looks like the algorithm
has gained some humility and is admitting when it's not sure about
stuff.  I can't say this change is a clear win for my miniscule data
set, but it *does* appear to make the probability more meaningful.
Almost like the difference between linear space and log space.

Neale