[Spambayes] Confused by tokens

Mark Hammond mhammond at skippinet.com.au
Tue Mar 4 10:38:13 EST 2003


*sigh* - I am wallowing in confusion today - reeling from bug-to-bug trying
to keep my eye on the ball as I go.

While looking into the previous "missing HTML payload" problem, I discovered
two issues:

1) Outlook's incremental training is *definitely* broken.  Unfortunately,
not in an obvious way.  It is possible to get hapaxes showing up in the
wrong category, or showing up multiple times.  Eg, I have confirmed that:

'url:vivapharmacy1'                 0.155172            1      0

is a hapax unique to this spam.  However, I have seen this occasionally with
a "2" in the ham column, a "1" in each of "ham" and "spam", and as above "1"
in ham even though the most recent operation was a "train as spam".  Simple
tests show that it works OK, so there is something subtle going on.  I'm
trying to track this down.

2) The point of this mail - I am confused by our tokens.  Again, it we look
at the clues for this message, we see:
'url:vivapharmacy1'                 0.155172            1      0

But the 'all tokens' list consists of:
"""
23 unique tokens

header:Importance:1
subject:Following
from:addr:yahoo.com
message-id:@atbsfwo.wvk
header:From:1
from:addr:domresgube
header:MIME-Version:1
x-mailer:microsoft outlook express 5.50.4522.1200
header:Subject:1
to:2**0
header:Received:9
subject:133
subject:2120uBwJ9
subject:
subject::
header:To:1
subject:-
from:no real name:2**0
content-type:multipart/mixed
header:Return-Path:1
header:Date:1
header:Message-ID:1
subject:
"""

ie, that token is not listed (and strangely 'subject:' is listed twice).
The code to dump the tokens is:

        from spambayes.tokenizer import tokenize
        from spambayes.classifier import Set # whatever classifier uses
        push("<h2>Message Tokens:</h2><br>")
        toks = Set(tokenize(msg))
        push("%d unique tokens<br>" % (len(toks),))
        push("<PRE>")
        for token in toks:
            push(escape(token) + "\n")
        push("</PRE>")

'push' id list.append, 'escape' is cgi.escape, and 'msg' is an 'email'
package object.

I am confused where our tokens came from, and why no 'url:' tokens appear in
the list of all tokens, even though they do appear in the clues list.

One-of-those-days ly,

Mark.




More information about the Spambayes mailing list