[spambayes-dev] correlated clues

Toby Dickenson tdickenson at geminidataloggers.com
Thu Jul 1 08:08:08 EDT 2004


On Thursday 01 July 2004 01:00, Tim Peters wrote:

> We have two anti-bad-correlation gimmicks now, driven by early testing
> results, and rationalized after the fact <wink>:
>
> 1. As mentioned last time, ignoring most header lines.  If we didn't,
> virtually all spam on mailing lists would score unsure or FN (thanks
> to a large number of distinct but correlated "I came from a mailing
> list" header tokens).

Thanks for the reminder of this hack.... it was the hint I needed to push this 
idea into an overall win....

> Maybe another pure but personalized hack would be to add a list of
> specific tokens you want the classifier to pretend didn't exist. 

Thats exactly where I was digging.... I have a small database of list-id (etc) 
headers. If that header is present, it inserts a list-id token, and inhibits 
all the tokens from a list-dependant set.

I started generating this token set by finding all tokens that are common to 
all messages on each list. That was a dramatic loss. Almost every email 
contains a 'subject' header - the presence of that header became a strong 
spam clue when a large proportion of my ham has that token inhibited .

Removing all 'header' clues from the set of inhibited tokens makes this an 
overall win for me. The final set of inhibited token for zope-dev is listed 
below. normal8 is without this hack, common7 with. I will polish this code 
enough for repeatable testing, and commit it on a branch tonight.

filename:      normal8     common7
ham:spam:   20972:4500  20971:4501
fp total:            1           1
fp %:             0.00        0.00
fn total:           88          50
fn %:             1.96        1.11
unsure t:          309         262
unsure %:         1.21        1.03
real cost:     $159.80     $112.40
best cost:     $132.40     $107.80
h mean:           0.16        0.20
h sdev:           2.07        2.46
s mean:          93.59       95.25
s sdev:          17.91       14.65
mean diff:       93.43       95.05
k:                4.68        5.56


html
url:listinfo
url:zope
(related
posts
subject:-
sender:addr:zope.org
encoding!
zope-dev
subject:Zope
url:zope-dev
email name:zope-dev
proto:http
url:org
sender:no real name:2**0
url:mail
subject:dev
url:mailman
content-type:text/plain
maillist
cross
email addr:zope.org
lists
url:zope-announce


-- 
Toby Dickenson



More information about the spambayes-dev mailing list