[Spambayes] Getting rid of max_spamprob and min_spamprob

Neil Schemenauer nas@python.ca
Sun, 15 Sep 2002 16:13:03 -0700


I don't like the max_spamprob and min_spamprob limits.  I've written a
version of spamprob() that uses long integers, does not clamp the
probabilities and uses all evidence.

    def spamprob(self, wordstream, evidence=False):
        wordinfoget = self.wordinfo.get
        numerator = denominator = 1L
        nham = self.nham
        nspam = self.nspam
        for word in Set(wordstream):
            record = wordinfoget(word)
            if record is None:
                continue
            hamcount = record.hamcount
            spamcount = record.spamcount
            if record.hamcount == 0:
                numerator *= nspam
                denominator *= (nham + 1) * spamcount
            elif record.spamcount == 0:
                numerator *= (nspam + 1) * hamcount
                denominator *= nham
            else:
                numerator *= nspam * hamcount
                denominator *= nham * spamcount
        real, frac = divmod(numerator, denominator)
        huge = 1L<<30
        if real > 0:
            if real > huge:
                prob = 0.0
            else:
                prob = 1.0 / (real + 1.0)
        else:
            if frac > huge:
                prob = 1.0
            else:
                prob = frac / (1.0 + frac)
        if evidence:
            return (prob, [])
        else:
            return prob

The results are interesting, IMHO.  First the rate summary:

    total unique false pos 113
    total unique false neg 0
    average fp % 6.27777777778
    average fn % 0.0

The fp rate sucks but the fn rate is great.  Here is the histograms for
all runs:

Ham distribution for all runs:
* = 28 items
  0.00 1668 ************************************************************
  2.50    7 *
  5.00    0 
  7.50    3 *
 10.00    0 
 12.50    0 
 15.00    0 
 17.50    0 
 20.00    1 *
 22.50    0 
 25.00    3 *
 27.50    0 
 30.00    0 
 32.50    2 *
 35.00    0 
 37.50    0 
 40.00    0 
 42.50    0 
 45.00    0 
 47.50    0 
 50.00    3 *
 52.50    0 
 55.00    0 
 57.50    0 
 60.00    0 
 62.50    0 
 65.00    0 
 67.50    0 
 70.00    0 
 72.50    0 
 75.00    0 
 77.50    0 
 80.00    0 
 82.50    0 
 85.00    0 
 87.50    0 
 90.00    0 
 92.50    0 
 95.00    0 
 97.50  113 *****

Spam distribution for all runs:
* = 30 items
  0.00    0 
  2.50    0 
  5.00    0 
  7.50    0 
 10.00    0 
 12.50    0 
 15.00    0 
 17.50    0 
 20.00    0 
 22.50    0 
 25.00    0 
 27.50    0 
 30.00    0 
 32.50    0 
 35.00    0 
 37.50    0 
 40.00    0 
 42.50    0 
 45.00    0 
 47.50    0 
 50.00    0 
 52.50    0 
 55.00    0 
 57.50    0 
 60.00    0 
 62.50    0 
 65.00    0 
 67.50    0 
 70.00    0 
 72.50    0 
 75.00    0 
 77.50    0 
 80.00    0 
 82.50    0 
 85.00    0 
 87.50    0 
 90.00    0 
 92.50    0 
 95.00    0 
 97.50 1800 ************************************************************

Perhaps there is some way we can swap the two rates by introducing some
bias.

  Neil