[Spambayes] Spam Clues ????? ??????

Mon Apr 28 14:33:29 CEST 2008

On Wed, April 16, 2008 19:34, David wrote:
>
> Am getting loads of spam with cyrillic characters and would like to know
> if
> Spambayes can automatically delete anything with these characters in their
> headers. Below is score info for typical one. If you need it,  could send
> you the config file if you can tell me where to find it.
>
> Kindest regards
> David Kanareck
>
>
>
>
>
> Combined Score: 57% (0.567348)
>
> Internal ham score (*H*): 0.285187
> Internal spam score (*S*): 0.419882
>

> # ham trained on: 39
> # spam trained on: 76

That is not much training. In my experience, Spambayes gets *extremely*
accurate after about 100 hams and 100 spams. Your mileage may vary.
With the Outlook plugin, I add a column that shows the spam score (see
FAQ/wiki for details). I sort on spam score. I look at the bottom and find
one spam with the lowest score. Train as spam. Rescore inbox. Now I look
at the top, and find one ham with the highest score. Train as ham,
rescore. Back to the lowest spam, rescore. Highest ham, rescore. Lather,
rince, repeat. Very quickly you will see that all spam scores above 99%
and all ham scores below 1%.

This method of training is so kewl that I have actually considered
installing Outlook on Linux, just so that I could train Spambayes this
way.

> 'message.'                          0.310872           15     13
>
> 'date:'                             0.325631           14     13
>
> 'checked'                           0.341867           13     13
>
> 'database:'                         0.341867           13     13
>
> 'incoming'                          0.341867           13     13
>
> 'version:'                          0.341867           13     13
>
> 'virus'                             0.35698            14     15
>
> 'release'                           0.358294           13     14
>
> 'avg.'                              0.359817           12     13
>
> 'skip:2 10'                         0.359817           12     13
>
> 'found'                             0.385564           14     17

These are generic tokens added by your virus scanner. After more training
they will score around .5 which means they will neither increase nor
decrease the global spam score of a message.

> 'to:no real name:2**0'              0.750084           10     59
>
> 'header:Received:1'                 0.893006            1     18

Interesting tokens...

> 'from:charset:koi8-r'               0.908163            0      2
>
> 'subjectcharset:koi8-r'             0.908163            0      2

And those last two are *really* interesting tokens!
Keep on training, I can already see that your Spambayes is improving.

-- 
Amedee Van Gasse
amedee at amedee.be

Disclaimer:
By sending an email to ANY of my addresses you are agreeing that:

   1. I am by definition, "the intended recipient"
   2. All information in the email is mine to do with as I see fit and
make such financial profit, political mileage, or good joke as it lends
itself to. In particular, I may quote it on usenet.
   3. I may take the contents as representing the views of your company.
   4. This overrides any disclaimer or statement of confidentiality that
may be included on your message.