[Spambayes] Just for fun

Moore, Paul Paul.Moore@atosorigin.com
Mon Nov 18 10:00:48 2002


From: Tim Peters [mailto:tim.one@comcast.net]
> Good!  On my tiny still-hapax-driven purely-mistake-based at-home
> classifier (which is up 79 each of ham and spam trained on) it
> fared much worse:

Mine got some interesting results... The DB is trained on 366 good,
496 spam, which came mostly from collected spam over a week or so,
plus the contents of my Inbox, and then training on mistakes (not
many). My Inbox is, in some senses, a *lousy* source of ham, as it's
mainly stuff I couldn't find a better home for. So it is 99% internal
mail (ie, from Exchange rather than Internet mail) and probably
comprises a spammier-than-average slice of my ham. But if I train
on all my ham (across multiple folders) I get a massive ham:spam
imbalance. (When I next get to CVS update, I'll try Tim's new tweak to
compensate for imbalances).

I'm not good at interpreting this stuff yet, but it came out as
solidly unsure, with some interesting features. The 'sender:no real
name:2**0' as a solid ham clue is almost certainly due to Exchange
(basically because Exchange doesn't do real headers, I expect) - I see
most internet headers as good spam clues, which is mildly worrying,
although hasn't caused any real issues yet.

The obvious implication is that getting a really good training corpus
is *hard*. Probably beyond the means of the average user. But as a
lousy corpus still gives good results, it's hard to decide whether or
not to care.

Here's the clues.

Spam Score: 0.349681


word                                spamprob         #ham  #spam
'*H*'                               0.998703            -      -
'*S*'                               0.698066            -      -
'sender:no real name:2**0'          0.00884086         25      0
'subject:['                         0.0155709          14      0
'url:mailman'                       0.0167286          13      0
'url:listinfo'                      0.0180723          12      0
'specific'                          0.0196507          11      0
'is.'                               0.0238095           9      0
'url:python'                        0.0266272           8      0
'to:addr:python.org'                0.0412844           5      0
'them,'                             0.0505618           4      0
'sender:addr:python.org'            0.0505618           4      0
'problem'                           0.0521891          44      3
'url:org'                           0.0567176          28      2
'know,'                             0.0652174           3      0
'email addr:python.org'             0.0652174           3      0
'skip:_ 40'                         0.0652174           3      0
'delivery'                          0.0676112          13      1
'updated'                           0.0676112          13      1
"can't"                             0.0683657          43      4
'running'                           0.0727202          12      1
'set'                               0.0789344          54      6
'date'                              0.0912609          24      3
'mission'                           0.0918367           2      0
'sorted'                            0.0918367           2      0
'host'                              0.104237            8      1
'base'                              0.116911            7      1
'various'                           0.121676           12      2
'using'                             0.125907           73     14
'content-type:text/plain'           0.128685          326     65
'however'                           0.133102            6      1
'back'                              0.145642           40      9
'ask'                               0.149462           22      5
'site.'                             0.154513            5      1
'solve'                             0.154513            5      1
'contains'                          0.154992            9      2
'net.'                              0.155172            1      0
'url:spambayes'                     0.155172            1      0
'sender:addr:spambayes-bounces'     0.155172            1      0
'spambayes'                         0.155172            1      0
'weekly.'                           0.155172            1      0
'subject:email'                     0.155172            1      0
'shut'                              0.155172            1      0
'second.'                           0.155172            1      0
'policies'                          0.155172            1      0
'parameters'                        0.155172            1      0
'emails.'                           0.155172            1      0
'email name:spambayes'              0.155172            1      0
'duplicate'                         0.155172            1      0
'together'                          0.170569            8      2
'current'                           0.175793           25      7
'closed'                            0.184169            4      1
'paying'                            0.184169            4      1
'data'                              0.184776           17      5
'there'                             0.184986           95     29
'meet'                              0.189638            7      2
'close'                             0.189638            7      2
'need'                              0.190325           98     31
'site'                              0.190699           32     10
'being'                             0.192172           41     13
'directly'                          0.204069           15      5
'they'                              0.206035           66     23
'may'                               0.206235          100     35
'use'                               0.208203           96     34
'been'                              0.223143           93     36
'have'                              0.229481          221     89
'header:Received:9'                 0.238618           24     10
'just'                              0.251571           97     44
'like'                              0.253328           70     32
'product'                           0.261037           15      7
'not'                               0.263917          200     97
'can'                               0.263933          165     80
'reply-to:none'                     0.266838          343    169
'noheader:reply-to'                 0.266838          343    169
'will'                              0.268784          175     87
'only'                              0.27345            67     34
'that'                              0.284503          223    120
'come'                              0.28675            26     14
'for'                               0.287229          253    138
'against'                           0.292477           11      6
'down'                              0.294767           25     14
'once'                              0.294767           25     14
'new'                               0.299542           90     52
'campaign'                          0.299577            2      1
'reliable'                          0.299577            2      1
'find'                              0.304047           56     33
'already'                           0.30613            22     13
'service'                           0.308224           40     24
'well'                              0.308669           30     18
'see'                               0.308851           63     38
'way'                               0.310625           33     20
'many'                              0.311285           28     17
"don't"                             0.317645           89     56
'again.'                            0.683667            4     12
'subject:.'                         0.697304           22     69
'card'                              0.698247            5     16
'low'                               0.70042             4     13
'totally'                           0.703898            3     10
'header:Errors-To:1'                0.718457           21     73
'header:Date:1'                     0.720317          142    496
'header:From:1'                     0.720317          142    496
'us,'                               0.72912             4     15
'header:Return-Path:1'              0.737732          130    496
'to:2**0'                           0.74404           123    485
'proto:http'                        0.772652           75    346
'to:no real name:2**0'              0.775127           90    421
'net'                               0.776394            2     10
'price'                             0.776394            2     10
'sites'                             0.776394            2     10
'success.'                          0.796678            1      6
'visit'                             0.797988           15     81
'url:www'                           0.805641           49    276
'matter'                            0.81794             2     13
'url:com'                           0.818306           50    306
'effective'                         0.819813            1      7
'marketing'                         0.831585            3     21
'companies'                         0.838229            1      8
'price.'                            0.844828            0      1
'subject:Bullet'                    0.844828            0      1
'time!'                             0.844828            0      1
'relax'                             0.844828            0      1
'proof'                             0.844828            0      1
'from:addr:concentric.net'          0.844828            0      1
'friendly'                          0.844828            0      1
'campaigns.'                        0.844828            0      1
'bullet'                            0.844828            0      1
'beautiful'                         0.844828            0      1
'$200'                              0.844828            0      1
'credit'                            0.871816            3     29
'emails'                            0.875534            3     30
'income'                            0.878287            2     21
'offer'                             0.890749            7     79
'merchant'                          0.908163            0      2
'complaints'                        0.908163            0      2
'cheap'                             0.908163            0      2
'header:Mime-Version:1'             0.923952           16    266
'url:mail'                          0.935447            3     62
'adult'                             0.958716            0      5
'lowest'                            0.958716            0      5
'advertise'                         0.969799            0      7
'prices'                            0.973373            0      8
'gambling'                          0.973373            0      8
'$500'                              0.97619             0      9
'hundreds'                          0.980349            0     11
'dollars'                           0.983271            0     13
'guarantee'                         0.983271            0     13
'thousands'                         0.984429            0     14
'million'                           0.990405            0     23
'bulk'                              0.990405            0     23
'advertising'                       0.990405            0     23
'advertised'                        0.995627            0     51
'websites'                          0.995942            0     55

Paul.



More information about the Spambayes mailing list