[Spambayes] Run filter and only return a report???
Tony Meyer
tameyer at ihug.co.nz
Thu Feb 10 23:32:32 CET 2005
> I'd probably only be interested in the tokens that were used
> in scoring and the output just needs to be in an easily parseable format.
If you do something like this:
python scripts/sb_filter.py -d hammie.db -o Headers:include_evidence:True <
msg.txt
You'll get a header in the message copy output to stdout that looks like
this:
X-Spambayes-Evidence: '*H*': 1.00; '*S*': 0.00; 'shows': 0.06; 'cheers':
0.09;
'finished': 0.09; 'show.': 0.09; 'channel': 0.09; 'url:nz': 0.10;
'school,': 0.13; 'enough': 0.13; 'space': 0.13; 'acting': 0.16;
'there,': 0.16; 'trailers': 0.16; 'xtra': 0.16; 'yeah,': 0.16;
"year's": 0.16; 'broadband': 0.17; 'year': 0.18; 'done': 0.18;
'movie': 0.19; 'keen': 0.20; 'url:co': 0.20; "i've": 0.21;
"you're": 0.21; 'getting': 0.21; 'high': 0.23; 'next': 0.23;
'find': 0.24; 'just': 0.25; 'first': 0.25; 'let': 0.26;
'but': 0.27; 'when': 0.27; 'couple': 0.28; 'know': 0.28;
'really': 0.29; 'like': 0.30; "don't": 0.30; 'online': 0.31;
'header:Mime-Version:1': 0.32; 'watch': 0.32; 'please': 0.33;
"i'm": 0.33; 'message-id:@hotmail.com': 0.34; 'with': 0.38;
'header:Return-path:1': 0.40; "subject:'": 0.66;
'from:addr:hotmail.com': 0.69; 'header:Received:4': 0.72;
'to:addr:madsods.gen.nz': 0.83; 'ellis': 0.84;
'subject:show': 0.84; 'subject:year': 0.84; 'skip:_ 60': 0.91
These are just the tokens that are used ('*H*' and '*S*' are special
internal tokens that represent the individual ham and spam scores; you
probably want to ignore those). Parsing that would be reasonably simple.
> Right, just give me a score, don't make any changes to the
> database or attempt to deliver the message.
Running the above command follows those rules.
> Thanks. I need to add Python to the list of programming
> languages I know.
It only takes a day <wink>.
> Basically, a friend who's company uses SpamBayes with the Outlook
> plug-in sent me a report he saw, here is a summary:
>
> Combined Score: 100% (0.999998)
> Internal ham score (*H*): 4.79832e-006
> Internal spam score (*S*): 1
>
> # ham trained on: 89
> # spam trained on: 1733
> 28 Significant Tokens
>
> token spamprob #ham #spam
> 'x-mailer:microsoft office outlook, build 11.0.6353' 0.168914
> 2 7
> 'url:org' 0.254701 13 86
> 'url:rec-html40' 0.277582 3 22
> 'skip:r 10' 0.284156 28 216
> 'skip:p 10' 0.321735 31 286
> 'url:tr' 0.372452 4 46
> 'url:www' 0.384768 63 767
> 'virus:src="cid:' 0.72041 3 151
> 'from:addr:level3.net' 0.844828 0 1
> 'subject:\xe4' 0.844828 0 1
> .
> .
>
> That's basically the kind of report I would like to see.
Ok, I've ripped out the code from the Outlook plug-in that does this and
converted it to a command-line script (attached). Run it something like:
python showclues.py -d hammie.db < msg.txt
It does output in HTML at the moment, because that's what the Outlook
plug-in does (for an Outlook-specific reason). It would be simple enough to
strip the HTML out of the script, though (I imagine even without knowledge
of Python). If you'd like that done, I don't mind doing it (this script
seems potentially useful enough for me to check it into the contrib/
directory). Let me know if there are any other improvements you can think
of.
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
-------------- next part --------------
import cgi
import sys
import getopt
from spambayes import storage
from spambayes import mboxutils
from spambayes.classifier import Set
from spambayes.Options import options
from spambayes.tokenizer import tokenize
def ShowClues(bayes, msg):
score, clues = bayes.spamprob(tokenize(msg), evidence=True)
body = ["<h2>Combined Score: %d%% (%g)</h2>\n" %
(round(score*100), score)]
push = body.append
# Format internal scores.
push("Internal ham score (<tt>%s</tt>): %g<br>\n" % clues.pop(0))
push("Internal spam score (<tt>%s</tt>): %g<br>\n" % clues.pop(0))
# Format the # ham and spam trained on.
push("<br>\n")
push("# ham trained on: %d<br>\n" % bayes.nham)
push("# spam trained on: %d<br>\n" % bayes.nspam)
push("<br>\n")
# Format the clues.
push("<h2>%s Significant Tokens</h2>\n<PRE>" % len(clues))
push("<strong>")
push("token spamprob #ham #spam\n")
push("</strong>\n")
format = " %-12g %8s %6s\n"
fetchword = bayes.wordinfo.get
for word, prob in clues:
record = fetchword(word)
if record:
nham = record.hamcount
nspam = record.spamcount
else:
nham = nspam = "-"
word = repr(word)
push(cgi.escape(word) + " " * (35-len(word)))
push(format % (prob, nham, nspam))
push("</PRE>\n")
# Now the raw text of the message
push("<h2>Message Stream</h2>\n")
push("<PRE>\n")
push(cgi.escape(msg.as_string()))
push("</PRE>\n")
# Show all the tokens in the message
push("<h2>All Message Tokens</h2>\n")
# need to re-fetch, as the tokens we see may be different based on
# header stripping.
toks = Set(tokenize(msg))
# create a sorted list
toks = list(toks)
toks.sort()
push("%d unique tokens<br><br>" % len(toks))
# Use <code> instead of <pre>, as <pre> is not word-wrapped by IE
# However, <code> does not require escaping.
# could use pprint, but not worth it.
for token in toks:
push("<code>" + repr(token) + "</code><br>\n")
# Put the body together, then the rest of the message.
body = ''.join(body)
body = """\
<HTML>
<HEAD>
<STYLE>
h2 {color: green}
</STYLE>
</HEAD>
<BODY>""" + body + "</BODY></HTML>"
return body
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], 'd:p:o:')
for opt, arg in opts:
if opt in ('-o', '--option'):
options.set_from_cmdline(arg, sys.stderr)
dbname, usedb = storage.database_type(opts)
bayes = storage.open_storage(dbname, usedb)
bayes.load()
if not args:
args = ["-"]
for fname in args:
mbox = mboxutils.getmbox(fname)
for msg in mbox:
print ShowClues(bayes, msg)
More information about the Spambayes
mailing list