[Spambayes] Confused by tokens
Mark Hammond
mhammond at skippinet.com.au
Tue Mar 4 10:38:13 EST 2003
*sigh* - I am wallowing in confusion today - reeling from bug-to-bug trying
to keep my eye on the ball as I go.
While looking into the previous "missing HTML payload" problem, I discovered
two issues:
1) Outlook's incremental training is *definitely* broken. Unfortunately,
not in an obvious way. It is possible to get hapaxes showing up in the
wrong category, or showing up multiple times. Eg, I have confirmed that:
'url:vivapharmacy1' 0.155172 1 0
is a hapax unique to this spam. However, I have seen this occasionally with
a "2" in the ham column, a "1" in each of "ham" and "spam", and as above "1"
in ham even though the most recent operation was a "train as spam". Simple
tests show that it works OK, so there is something subtle going on. I'm
trying to track this down.
2) The point of this mail - I am confused by our tokens. Again, it we look
at the clues for this message, we see:
'url:vivapharmacy1' 0.155172 1 0
But the 'all tokens' list consists of:
"""
23 unique tokens
header:Importance:1
subject:Following
from:addr:yahoo.com
message-id:@atbsfwo.wvk
header:From:1
from:addr:domresgube
header:MIME-Version:1
x-mailer:microsoft outlook express 5.50.4522.1200
header:Subject:1
to:2**0
header:Received:9
subject:133
subject:2120uBwJ9
subject:
subject::
header:To:1
subject:-
from:no real name:2**0
content-type:multipart/mixed
header:Return-Path:1
header:Date:1
header:Message-ID:1
subject:
"""
ie, that token is not listed (and strangely 'subject:' is listed twice).
The code to dump the tokens is:
from spambayes.tokenizer import tokenize
from spambayes.classifier import Set # whatever classifier uses
push("<h2>Message Tokens:</h2><br>")
toks = Set(tokenize(msg))
push("%d unique tokens<br>" % (len(toks),))
push("<PRE>")
for token in toks:
push(escape(token) + "\n")
push("</PRE>")
'push' id list.append, 'escape' is cgi.escape, and 'msg' is an 'email'
package object.
I am confused where our tokens came from, and why no 'url:' tokens appear in
the list of all tokens, even though they do appear in the clues list.
One-of-those-days ly,
Mark.
More information about the Spambayes
mailing list