[Spambayes] Low-priority feature request

Sun May 23 01:31:40 EDT 2004

[John Byrd]
> ...
> I believe that in my spam database, there is a TON of useful information
> about how spammers have found me and how they're spamming me.  I know that
> a database import/export command exists... but my specific request...
> would be an advanced reporting command that showed the tokens in the
> database which are MOST indicative of "spam"... say the top 100 or so.

I believe the database would have to save additional info to have any hope
of giving a meaningful report about "most indicative of spam".  These aren't
necessarily the tokens with the highest spamprobs!  For most people, who do
some form of mistake-based training, the tokens with the highest spamprobs
are merely those that got *trained* on most often.

The 10 highest-spamprob tokens in my database today illustrate this nicely:

Token                                 spamprob nham  nspam
url:biz                                  0.996    0     51
bi:url:index url:php                     0.992    0     28
subject:Out                              0.991    0     25
subject:AutoReply                        0.991    0     25
bi:subject:Out subject:Office            0.991    0     25
bi:subject:Office subject:AutoReply      0.991    0     24
subject:Delivery                         0.990    0     21
received:218                             0.988    0     18
bi:for contacting                        0.988    0     18
bi:delivery (failure                     0.988    0     18

(I have bigrams enabled, which is where the "bi:" tokens come from)

What this shows is what I already knew <wink>:  most of the Unsures I train
on as spam are autoreply or bounce kinds of messages, due to virus and spam
email forged to appear as if it came from one of the public admin and help
addresses I volunteer for, or from one of my personal addresses.  I get a
ton of these, and they're spam to me.

Mine is probably an extreme case, but I expect you'd be disappointed in
seeing your highest-spamprob words too, unless you train on everything.

At the start of this project, the database saved more info, including the
most recent time a token was *used* in scoring, and how often a token
contributed to a correct classification in scoring.  The latter in
particular is a much better measure of "more indicative of spam (or ham)":
it's the tokens that help most often in nailing new messages to a correct
classification that are the most *valuable* tokens you have.  But that can't
be deduced from what the slimmed-down database contains.

BTW, I have 5,397 tokens with spamprob > .9 of 211,736 total right now.
There are 8,934 with spamprob < .1.  Some of the 10 lowest surprised me:

bi:received:127.0.0.1 message-id:@python.org      0.012  18   0
zope                                              0.012  18   0
bi:header:User-Agent:1 header:Errors-To:1         0.013  83   1
bi:header:X-Complaints-To:1 header:Mime-Version:1 0.013  17   0
bi:header:From:1 header:Return-path:1             0.015  15   0
bi:received:172.20 received:172.20.3              0.015  15   0
header:Return-path:1                              0.015  15   0
received:172.20.3                                 0.015  15   0
bi:header:X-Complaints-To:1 header:MIME-Version:1 0.016  14   0
from:addr:zope.com                                0.016  14   0

One surprise there is that almost all of them came from the headers.