[Spambayes] Confused by tokens
Tim Peters
tim.one at comcast.net
Mon Mar 3 20:11:21 EST 2003
[Mark Hammond]
> *sigh* - I am wallowing in confusion today - reeling from
> bug-to-bug trying to keep my eye on the ball as I go.
>
> While looking into the previous "missing HTML payload" problem, I
> discovered two issues:
>
> 1) Outlook's incremental training is *definitely* broken. Unfortunately,
> not in an obvious way. It is possible to get hapaxes showing up in the
> wrong category, or showing up multiple times. Eg, I have confirmed that:
>
> 'url:vivapharmacy1' 0.155172 1 0
>
> is a hapax unique to this spam. However, I have seen this
> occasionally with a "2" in the ham column, a "1" in each of "ham" and
> "spam", and as above "1" in ham even though the most recent operation
> was a "train as spam". Simple tests show that it works OK, so there is
> something subtle going on. I'm trying to track this down.
I haven't seen this, but I haven't updated my spambayes directory in at
least a month (ain't broke, why fix <wink>).
> 2) The point of this mail - I am confused by our tokens. Again,
> it we look at the clues for this message, we see:
> 'url:vivapharmacy1' 0.155172 1 0
That clue must have come from the body of the msg. I note that *all* the
tokens you show next came from the headers:
> But the 'all tokens' list consists of:
> """
> 23 unique tokens
>
> header:Importance:1
> subject:Following
> from:addr:yahoo.com
> message-id:@atbsfwo.wvk
> header:From:1
> from:addr:domresgube
> header:MIME-Version:1
> x-mailer:microsoft outlook express 5.50.4522.1200
> header:Subject:1
> to:2**0
> header:Received:9
> subject:133
> subject:2120uBwJ9
> subject:
> subject::
> header:To:1
> subject:-
> from:no real name:2**0
> content-type:multipart/mixed
> header:Return-Path:1
> header:Date:1
> header:Message-ID:1
> subject:
> """
>
> ie, that token is not listed
There are no body tokens here at all. I don't expect that to be obvious to
anyone, I just happen to know that all those prefix tags ("header:",
"subject:", etc) come from tokenize_headers() (as opposed to
tokenize_body(), from which "url:"-tagged tokens come).
> (and strangely 'subject:' is listed twice).
Probably not <wink>. Tokenization of a Subject header is unique in one
respect:
for w in punctuation_run_re.findall(x):
yield 'subject:' + w
where
punctuation_run_re = re.compile(r'\W+')
IOW, runs of (among other things) consecutive whitespace characters count as
tokens in a subject line, but they don't anywhere else. This made a small
but real improvement in tests at the time, likely because of spam subject
lines of the form
Subject: Get Big Now! random_gibberish_here
You probably can't see the difference between:
subject:
and
subject:
but they're distinct tokens (the first is a single blank, the second a run
of 30 blanks).
> The code to dump the tokens is:
>
> from spambayes.tokenizer import tokenize
> from spambayes.classifier import Set # whatever classifier uses
> push("<h2>Message Tokens:</h2><br>")
> toks = Set(tokenize(msg))
> push("%d unique tokens<br>" % (len(toks),))
You could write that
push("%d unique tokens<br>" % len(toks))
> push("<PRE>")
> for token in toks:
> push(escape(token) + "\n")
> push("</PRE>")
>
> 'push' id list.append, 'escape' is cgi.escape, and 'msg' is an 'email'
> package object.
>
> I am confused where our tokens came from, and why no 'url:'
> tokens appear in the list of all tokens, even though they do appear
> in the clues list.
I can only guess that msg only contained headers in this case, or that
damaged MIME structure in the body caused the email pkg to give up in a way
the tokenizer didn't recover from. But then I wonder how we *ever* got a
url: token out of the body.
> One-of-those-days ly,
Indeed it is <wink>.
More information about the Spambayes
mailing list