[Spambayes] Confused by tokens

Mon Mar 3 20:11:21 EST 2003

[Mark Hammond]
> *sigh* - I am wallowing in confusion today - reeling from
> bug-to-bug trying to keep my eye on the ball as I go.
>
> While looking into the previous "missing HTML payload" problem, I
> discovered two issues:
>
> 1) Outlook's incremental training is *definitely* broken.  Unfortunately,
> not in an obvious way.  It is possible to get hapaxes showing up in the
> wrong category, or showing up multiple times.  Eg, I have confirmed that:
>
> 'url:vivapharmacy1'                 0.155172            1      0
>
> is a hapax unique to this spam.  However, I have seen this
> occasionally with a "2" in the ham column, a "1" in each of "ham" and
> "spam", and as above "1" in ham even though the most recent operation
> was a "train as spam".  Simple tests show that it works OK, so there is
> something subtle going on.  I'm trying to track this down.

I haven't seen this, but I haven't updated my spambayes directory in at
least a month (ain't broke, why fix <wink>).

> 2) The point of this mail - I am confused by our tokens.  Again,
> it we look at the clues for this message, we see:
> 'url:vivapharmacy1'                 0.155172            1      0

That clue must have come from the body of the msg.  I note that *all* the
tokens you show next came from the headers:

> But the 'all tokens' list consists of:
> """
> 23 unique tokens
>
> header:Importance:1
> subject:Following
> from:addr:yahoo.com
> message-id:@atbsfwo.wvk
> header:From:1
> from:addr:domresgube
> header:MIME-Version:1
> x-mailer:microsoft outlook express 5.50.4522.1200
> header:Subject:1
> to:2**0
> header:Received:9
> subject:133
> subject:2120uBwJ9
> subject:
> subject::
> header:To:1
> subject:-
> from:no real name:2**0
> content-type:multipart/mixed
> header:Return-Path:1
> header:Date:1
> header:Message-ID:1
> subject:
> """
>
> ie, that token is not listed

There are no body tokens here at all.  I don't expect that to be obvious to
anyone, I just happen to know that all those prefix tags ("header:",
"subject:", etc) come from tokenize_headers() (as opposed to
tokenize_body(), from which "url:"-tagged tokens come).

> (and strangely 'subject:' is listed twice).

Probably not <wink>.  Tokenization of a Subject header is unique in one
respect:

            for w in punctuation_run_re.findall(x):
                yield 'subject:' + w

where

     punctuation_run_re = re.compile(r'\W+')

IOW, runs of (among other things) consecutive whitespace characters count as
tokens in a subject line, but they don't anywhere else.  This made a small
but real improvement in tests at the time, likely because of spam subject
lines of the form

Subject: Get Big Now!                              random_gibberish_here

You probably can't see the difference between:

    subject:

and

    subject:

but they're distinct tokens (the first is a single blank, the second a run
of 30 blanks).

> The code to dump the tokens is:
>
>         from spambayes.tokenizer import tokenize
>         from spambayes.classifier import Set # whatever classifier uses
>         push("<h2>Message Tokens:</h2><br>")
>         toks = Set(tokenize(msg))
>         push("%d unique tokens<br>" % (len(toks),))

You could write that

          push("%d unique tokens<br>" % len(toks))

>         push("<PRE>")
>         for token in toks:
>             push(escape(token) + "\n")
>         push("</PRE>")
>
> 'push' id list.append, 'escape' is cgi.escape, and 'msg' is an 'email'
> package object.
>
> I am confused where our tokens came from, and why no 'url:'
> tokens appear in the list of all tokens, even though they do appear
> in the clues list.

I can only guess that msg only contained headers in this case, or that
damaged MIME structure in the body caused the email pkg to give up in a way
the tokenizer didn't recover from.  But then I wonder how we *ever* got a
url: token out of the body.

> One-of-those-days ly,

Indeed it is <wink>.