[spambayes-dev] Mozilla SpamBayes "porting"

Wed Feb 18 17:21:39 EST 2004

Sorry for the semi-offtopic post, but any help will be very apreciated.

Mozilla's mail apps currently use Paul Graham's original algorithm with some basic tokenizing techniques.  This 
situation could use some improvement, so now there is an effort to copy some ideas from Spambayes into Mozilla.

I wrote a Mozilla patch that tries to port the chi2-combining techniques from classifier.py into Mozilla's C++.  My 
testing is showing huge improvements in the fn rates, but a big deterioration in the fp rates. For example, in a test 
with a 3,741 email corpus we got:
original - fn:206 fp:0
chi2 patch - fn:63 fp:11

My question is, did you guys notice a similar increase in fp rates when you originally switched from Graham to chi2?  If 
not, then I'll assume that I made a mistake in porting classifier.py.

Many thanks,
Miguel

PS.
If anyone is interested in what Mozilla is doing, you can look here:
http://bugzilla.mozilla.org/show_bug.cgi?id=181534
http://bugzilla.mozilla.org/show_bug.cgi?id=230093
http://bugzilla.mozilla.org/show_bug.cgi?id=231873

Here is the core of my C++ port if anyone wants to take a look.  You'll notice that I included the 
"experimental_ham_spam_imbalance_adjustment", could this be my problem?

double spam2ham = dmin(nbad/ngood, 1.0);
double ham2spam = dmin(ngood/nbad, 1.0);

/** This section comes from probability(self, record) and _getclues(self, wordstream)**/
    for (i = 0; i < count; ++i) {
         Token& token = tokens[i];
// tokens is an array of Token, elements of a Token
// include both token.mProbability and token.mDistance

         const char* word = token.mWord;
         Token* t = mGoodTokens.get(word);
         double hamcount = ((t != NULL) ? t->mCount : 0);
         t = mBadTokens.get(word);
         double spamcount = ((t != NULL) ? t->mCount : 0);

         prob = (spamcount / nbad) / ( hamcount / ngood + spamcount / nbad);
         double n = hamcount * spam2ham + spamcount * ham2spam;
         prob =  (0.225 + n * prob) / (.45 + n);
         double distance = abs(prob - 0.5);
         if (distance >= .1) {
                 goodclues++;
                 token.mDistance = distance;
                 token.mProbability = prob;

         } else {
                 token.mDistance = -1; //ignore clue
         }
     }

     // sort the array by the token distances
     PRUint32 first, last = count;
     if (count > 150) {
         first = count - 150;

	//  This function sorts the array by token.mDistance
         NS_QuickSort(tokens, count, sizeof(Token), compareTokens, NULL);
     } else {
         first = 0;
     }

/** This section comes from chi2_spamprob(self, wordstream, evidence=False) **/
     double H = 1.0, S = 1.0, Hexp = 0, Sexp = 0;
     goodclues=0;
     int e;
     for (i = first; i < last; ++i) {
         if (tokens[i].mDistance != -1) {
             goodclues++;
             double value = tokens[i].mProbability;
             S *= (1.0 - value);
             H *= value;
             if ( S < 1e-200 ) {
                     S = frexp(S, &e);
                     Sexp += M_E;
             }
             if ( H < 1e-200 ) {
                     H = frexp(H, &e);
                     Hexp +=M_E;
             }
         }
     }

     S = log(S) + Sexp * M_LN2;
     H = log(H) + Hexp * M_LN2;

     if (goodclues>0) {
         S = 1.0 - chi2Q(-2.0 * S, 2 * goodclues);
         H = 1.0 - chi2Q(-2.0 * H, 2 * goodclues);
         prob = (S-H +1.0) / 2.0;
     } else {
         prob = 0.5;
     }

     PRBool isJunk = (prob >= 0.90); //hardcoded at .9

------------------------------------
Here's the chi2Q funcition:
static double chi2Q (double x2, double v) {
         PRUint32 i;
         double m = x2 / 2.0;
         double sum = exp(-m);
         double term = exp(-m);

         for (i=1;i<=floor(v/2);i++) {
                 term *= m / i;
                 sum += term;
         }
         return dmin(sum,1.0);
}