[Spambayes] ok, i'm confused

Fri Mar 7 17:19:25 EST 2003

    Tim> I removed that part, in order to make an internal inconsistency
    Tim> clearer: the overall score is

    Tim>             prob = (S-H + 1.0) / 2.0

    Tim> and 0.95 simply doesn't make any sense with H ~= 0.56 and S ~=
    Tim> 0.47.

Problem solved.  The message had already been run through spambayes once, so
it already had X-Spambayes-Classification and X-Spambayes-Debug headers.
The second time I ran it through hammiefilter manually I forgot to set
BAYESCUSTOMIZE, so it didn't add a new debug header.  It did, however,
replace the original classification header with the new one.  (Maybe all
X-Spambayes headers should be deleted by default?)

Here's what the Spambayes headers for that message look like now:

  X-Spambayes-Classification: spam; 1.00
  X-Spambayes-Debug: '*H*': 0.00; '*S*': 1.00; 'charset:us-ascii': 0.17;
          'header:Message-ID:1': 0.34; 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.66;
          'to:addr:bugs': 0.73; 'skip:1 10': 0.76; 'bytes/words: 2': 0.84;
          'cc:addr:bugsmoke': 0.84; 'cc:addr:bugsmom16': 0.84;
          'cc:addr:bugsmom_1982': 0.84; 'from:addr:diplomas.org': 0.84;
          'from:addr:learning': 0.84; 'from:name:marie': 0.84;
          'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84;
          'to:addr:moi.com': 0.84; 'pfxlen:2': 0.87; 'cc:no real name:2**2': 0.87;
          'cc:addr:mojam.com': 0.89; 'cc:addr:yahoo.com': 0.89;
          'header:Received:3': 0.90; 'cc:addr:msn.com': 0.96;
          'cc:addr:gateway.net': 0.97; 'cc:addr:bugs': 0.99

Note there are many more clues than before as well:

  X-Spambayes-Classification: unsure; 0.46
  X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
          'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62;
          'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90

The original time it was run was against the spambayes sw and database I
have on the Mojam web server (something I didn't notice originally either).
I think either the database or the software there is getting a bit
out-of-date.  Note the lack of cc:addr headers which put this squarely in
the spam domain.

At this point, I'm going to hold off on the bytes/words ratio stuff.  If
anyone wants to play around with it, I'll be happy to send you a context
diff for tokenize.py.

Skip