[Spambayes] filtering based on headers only?

Andrew Aksyonoff shodan at lipetsk.ru
Wed Aug 13 12:32:05 EDT 2003


Hello.

I'm trying to setup SpamBayes to do training and spam filtering
based on headers only. This weird requirement is because I get lots
of spam and viruses in my e-mail, and don't want to download it all
from ISP, as we pay for incoming traffic here and its very costly
here. ;(

While examining the source to do necessary patches, I found out
that the block responsible for stats updating and messages cacheing
is under if command == 'RETR': - thus effectively preventing training
by headers only.

So the first question is - why is it so, and what do I do?
I can think of two answers:

1) I should keep the distribution untouched, actually download some
   spams, train SpamBayes on them, and then disable cacheing and hope
   that SpamBayes will do good enough detection and insert proper
   headers while MUA is doing header retrieval - despite it was
   trained on full messages with bodies.

2) Commenting out that if command == 'RETR' - and then training on
   headers only - is of course not as ok as training on bodies too,
   but should work.

Now, for the second question.
I get the following in the log (the message is spam BTW):

------------------------
+OK Message follows
Return-Path: <root at falcon.lipetsk.ru>
X-Sieve: cmu-sieve 2.0
Received: from pool-141-150-203-85.delv.east.verizon.net ([141.150.203.85]:36617
        "HELO compuserve.com") by falcon.lipetsk.ru with SMTP
        id <S678021AbTHMGxG>; Wed, 13 Aug 2003 10:53:06 +0400
Date:   Wed, 13 Aug 2003 05:55:06 +0000
From:   bert at isis.msstate.edu
Subject: Re:╕Ёюшчтюфё╙тхээvх яыю∙рфш.
To:     Webmaster <webmaster at lipetsk.ru>
References: <DC0DK5869LC3287J at lipetsk.ru>
In-Reply-To: <DC0DK5869LC3287J at lipetsk.ru>
Message-ID: <9D07KEE5KAL75A7I at isis.msstate.edu>
MIME-Version: 1.0
Content-Type: text/html; charset=Windows-1251
Content-Transfer-Encoding: 8bit
X-Spambayes-Exception: exceptions.UnicodeDecodeError('ascii' codec can't
        decode byte 0xcf in position 3: ordinal not in range(128)) in
        append() at C:\Program Files\Python\lib\email\Header.py line
        272: ustr = unicode(s, incodec, errors)
------------------------

and it seems to me that exception instead of spam classification is
not what I really want. ;) However, that message (headers only) gets to the
cache OK, without any exceptions.

Could you please advice?
Thanks in advance.

- Andrew




More information about the Spambayes mailing list