[Spambayes] Just for fun
Moore, Paul
Paul.Moore@atosorigin.com
Mon Nov 18 10:00:48 2002
From: Tim Peters [mailto:tim.one@comcast.net]
> Good! On my tiny still-hapax-driven purely-mistake-based at-home
> classifier (which is up 79 each of ham and spam trained on) it
> fared much worse:
Mine got some interesting results... The DB is trained on 366 good,
496 spam, which came mostly from collected spam over a week or so,
plus the contents of my Inbox, and then training on mistakes (not
many). My Inbox is, in some senses, a *lousy* source of ham, as it's
mainly stuff I couldn't find a better home for. So it is 99% internal
mail (ie, from Exchange rather than Internet mail) and probably
comprises a spammier-than-average slice of my ham. But if I train
on all my ham (across multiple folders) I get a massive ham:spam
imbalance. (When I next get to CVS update, I'll try Tim's new tweak to
compensate for imbalances).
I'm not good at interpreting this stuff yet, but it came out as
solidly unsure, with some interesting features. The 'sender:no real
name:2**0' as a solid ham clue is almost certainly due to Exchange
(basically because Exchange doesn't do real headers, I expect) - I see
most internet headers as good spam clues, which is mildly worrying,
although hasn't caused any real issues yet.
The obvious implication is that getting a really good training corpus
is *hard*. Probably beyond the means of the average user. But as a
lousy corpus still gives good results, it's hard to decide whether or
not to care.
Here's the clues.
Spam Score: 0.349681
word spamprob #ham #spam
'*H*' 0.998703 - -
'*S*' 0.698066 - -
'sender:no real name:2**0' 0.00884086 25 0
'subject:[' 0.0155709 14 0
'url:mailman' 0.0167286 13 0
'url:listinfo' 0.0180723 12 0
'specific' 0.0196507 11 0
'is.' 0.0238095 9 0
'url:python' 0.0266272 8 0
'to:addr:python.org' 0.0412844 5 0
'them,' 0.0505618 4 0
'sender:addr:python.org' 0.0505618 4 0
'problem' 0.0521891 44 3
'url:org' 0.0567176 28 2
'know,' 0.0652174 3 0
'email addr:python.org' 0.0652174 3 0
'skip:_ 40' 0.0652174 3 0
'delivery' 0.0676112 13 1
'updated' 0.0676112 13 1
"can't" 0.0683657 43 4
'running' 0.0727202 12 1
'set' 0.0789344 54 6
'date' 0.0912609 24 3
'mission' 0.0918367 2 0
'sorted' 0.0918367 2 0
'host' 0.104237 8 1
'base' 0.116911 7 1
'various' 0.121676 12 2
'using' 0.125907 73 14
'content-type:text/plain' 0.128685 326 65
'however' 0.133102 6 1
'back' 0.145642 40 9
'ask' 0.149462 22 5
'site.' 0.154513 5 1
'solve' 0.154513 5 1
'contains' 0.154992 9 2
'net.' 0.155172 1 0
'url:spambayes' 0.155172 1 0
'sender:addr:spambayes-bounces' 0.155172 1 0
'spambayes' 0.155172 1 0
'weekly.' 0.155172 1 0
'subject:email' 0.155172 1 0
'shut' 0.155172 1 0
'second.' 0.155172 1 0
'policies' 0.155172 1 0
'parameters' 0.155172 1 0
'emails.' 0.155172 1 0
'email name:spambayes' 0.155172 1 0
'duplicate' 0.155172 1 0
'together' 0.170569 8 2
'current' 0.175793 25 7
'closed' 0.184169 4 1
'paying' 0.184169 4 1
'data' 0.184776 17 5
'there' 0.184986 95 29
'meet' 0.189638 7 2
'close' 0.189638 7 2
'need' 0.190325 98 31
'site' 0.190699 32 10
'being' 0.192172 41 13
'directly' 0.204069 15 5
'they' 0.206035 66 23
'may' 0.206235 100 35
'use' 0.208203 96 34
'been' 0.223143 93 36
'have' 0.229481 221 89
'header:Received:9' 0.238618 24 10
'just' 0.251571 97 44
'like' 0.253328 70 32
'product' 0.261037 15 7
'not' 0.263917 200 97
'can' 0.263933 165 80
'reply-to:none' 0.266838 343 169
'noheader:reply-to' 0.266838 343 169
'will' 0.268784 175 87
'only' 0.27345 67 34
'that' 0.284503 223 120
'come' 0.28675 26 14
'for' 0.287229 253 138
'against' 0.292477 11 6
'down' 0.294767 25 14
'once' 0.294767 25 14
'new' 0.299542 90 52
'campaign' 0.299577 2 1
'reliable' 0.299577 2 1
'find' 0.304047 56 33
'already' 0.30613 22 13
'service' 0.308224 40 24
'well' 0.308669 30 18
'see' 0.308851 63 38
'way' 0.310625 33 20
'many' 0.311285 28 17
"don't" 0.317645 89 56
'again.' 0.683667 4 12
'subject:.' 0.697304 22 69
'card' 0.698247 5 16
'low' 0.70042 4 13
'totally' 0.703898 3 10
'header:Errors-To:1' 0.718457 21 73
'header:Date:1' 0.720317 142 496
'header:From:1' 0.720317 142 496
'us,' 0.72912 4 15
'header:Return-Path:1' 0.737732 130 496
'to:2**0' 0.74404 123 485
'proto:http' 0.772652 75 346
'to:no real name:2**0' 0.775127 90 421
'net' 0.776394 2 10
'price' 0.776394 2 10
'sites' 0.776394 2 10
'success.' 0.796678 1 6
'visit' 0.797988 15 81
'url:www' 0.805641 49 276
'matter' 0.81794 2 13
'url:com' 0.818306 50 306
'effective' 0.819813 1 7
'marketing' 0.831585 3 21
'companies' 0.838229 1 8
'price.' 0.844828 0 1
'subject:Bullet' 0.844828 0 1
'time!' 0.844828 0 1
'relax' 0.844828 0 1
'proof' 0.844828 0 1
'from:addr:concentric.net' 0.844828 0 1
'friendly' 0.844828 0 1
'campaigns.' 0.844828 0 1
'bullet' 0.844828 0 1
'beautiful' 0.844828 0 1
'$200' 0.844828 0 1
'credit' 0.871816 3 29
'emails' 0.875534 3 30
'income' 0.878287 2 21
'offer' 0.890749 7 79
'merchant' 0.908163 0 2
'complaints' 0.908163 0 2
'cheap' 0.908163 0 2
'header:Mime-Version:1' 0.923952 16 266
'url:mail' 0.935447 3 62
'adult' 0.958716 0 5
'lowest' 0.958716 0 5
'advertise' 0.969799 0 7
'prices' 0.973373 0 8
'gambling' 0.973373 0 8
'$500' 0.97619 0 9
'hundreds' 0.980349 0 11
'dollars' 0.983271 0 13
'guarantee' 0.983271 0 13
'thousands' 0.984429 0 14
'million' 0.990405 0 23
'bulk' 0.990405 0 23
'advertising' 0.990405 0 23
'advertised' 0.995627 0 51
'websites' 0.995942 0 55
Paul.
More information about the Spambayes
mailing list