[spambayes-dev] correlated clues
Toby Dickenson
tdickenson at geminidataloggers.com
Thu Jul 1 08:08:08 EDT 2004
On Thursday 01 July 2004 01:00, Tim Peters wrote:
> We have two anti-bad-correlation gimmicks now, driven by early testing
> results, and rationalized after the fact <wink>:
>
> 1. As mentioned last time, ignoring most header lines. If we didn't,
> virtually all spam on mailing lists would score unsure or FN (thanks
> to a large number of distinct but correlated "I came from a mailing
> list" header tokens).
Thanks for the reminder of this hack.... it was the hint I needed to push this
idea into an overall win....
> Maybe another pure but personalized hack would be to add a list of
> specific tokens you want the classifier to pretend didn't exist.
Thats exactly where I was digging.... I have a small database of list-id (etc)
headers. If that header is present, it inserts a list-id token, and inhibits
all the tokens from a list-dependant set.
I started generating this token set by finding all tokens that are common to
all messages on each list. That was a dramatic loss. Almost every email
contains a 'subject' header - the presence of that header became a strong
spam clue when a large proportion of my ham has that token inhibited .
Removing all 'header' clues from the set of inhibited tokens makes this an
overall win for me. The final set of inhibited token for zope-dev is listed
below. normal8 is without this hack, common7 with. I will polish this code
enough for repeatable testing, and commit it on a branch tonight.
filename: normal8 common7
ham:spam: 20972:4500 20971:4501
fp total: 1 1
fp %: 0.00 0.00
fn total: 88 50
fn %: 1.96 1.11
unsure t: 309 262
unsure %: 1.21 1.03
real cost: $159.80 $112.40
best cost: $132.40 $107.80
h mean: 0.16 0.20
h sdev: 2.07 2.46
s mean: 93.59 95.25
s sdev: 17.91 14.65
mean diff: 93.43 95.05
k: 4.68 5.56
html
url:listinfo
url:zope
(related
posts
subject:-
sender:addr:zope.org
encoding!
zope-dev
subject:Zope
url:zope-dev
email name:zope-dev
proto:http
url:org
sender:no real name:2**0
url:mail
subject:dev
url:mailman
content-type:text/plain
maillist
cross
email addr:zope.org
lists
url:zope-announce
--
Toby Dickenson
More information about the spambayes-dev
mailing list