[Spambayes] ok, i'm confused
Skip Montanaro
skip at pobox.com
Fri Mar 7 17:19:25 EST 2003
Tim> I removed that part, in order to make an internal inconsistency
Tim> clearer: the overall score is
Tim> prob = (S-H + 1.0) / 2.0
Tim> and 0.95 simply doesn't make any sense with H ~= 0.56 and S ~=
Tim> 0.47.
Problem solved. The message had already been run through spambayes once, so
it already had X-Spambayes-Classification and X-Spambayes-Debug headers.
The second time I ran it through hammiefilter manually I forgot to set
BAYESCUSTOMIZE, so it didn't add a new debug header. It did, however,
replace the original classification header with the new one. (Maybe all
X-Spambayes headers should be deleted by default?)
Here's what the Spambayes headers for that message look like now:
X-Spambayes-Classification: spam; 1.00
X-Spambayes-Debug: '*H*': 0.00; '*S*': 1.00; 'charset:us-ascii': 0.17;
'header:Message-ID:1': 0.34; 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.66;
'to:addr:bugs': 0.73; 'skip:1 10': 0.76; 'bytes/words: 2': 0.84;
'cc:addr:bugsmoke': 0.84; 'cc:addr:bugsmom16': 0.84;
'cc:addr:bugsmom_1982': 0.84; 'from:addr:diplomas.org': 0.84;
'from:addr:learning': 0.84; 'from:name:marie': 0.84;
'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84;
'to:addr:moi.com': 0.84; 'pfxlen:2': 0.87; 'cc:no real name:2**2': 0.87;
'cc:addr:mojam.com': 0.89; 'cc:addr:yahoo.com': 0.89;
'header:Received:3': 0.90; 'cc:addr:msn.com': 0.96;
'cc:addr:gateway.net': 0.97; 'cc:addr:bugs': 0.99
Note there are many more clues than before as well:
X-Spambayes-Classification: unsure; 0.46
X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62;
'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90
The original time it was run was against the spambayes sw and database I
have on the Mojam web server (something I didn't notice originally either).
I think either the database or the software there is getting a bit
out-of-date. Note the lack of cc:addr headers which put this squarely in
the spam domain.
At this point, I'm going to hold off on the bytes/words ratio stuff. If
anyone wants to play around with it, I'll be happy to send you a context
diff for tokenize.py.
Skip
More information about the Spambayes
mailing list