[Spambayes] filtering based on headers only?
Andrew Aksyonoff
shodan at lipetsk.ru
Wed Aug 13 12:32:05 EDT 2003
Hello.
I'm trying to setup SpamBayes to do training and spam filtering
based on headers only. This weird requirement is because I get lots
of spam and viruses in my e-mail, and don't want to download it all
from ISP, as we pay for incoming traffic here and its very costly
here. ;(
While examining the source to do necessary patches, I found out
that the block responsible for stats updating and messages cacheing
is under if command == 'RETR': - thus effectively preventing training
by headers only.
So the first question is - why is it so, and what do I do?
I can think of two answers:
1) I should keep the distribution untouched, actually download some
spams, train SpamBayes on them, and then disable cacheing and hope
that SpamBayes will do good enough detection and insert proper
headers while MUA is doing header retrieval - despite it was
trained on full messages with bodies.
2) Commenting out that if command == 'RETR' - and then training on
headers only - is of course not as ok as training on bodies too,
but should work.
Now, for the second question.
I get the following in the log (the message is spam BTW):
------------------------
+OK Message follows
Return-Path: <root at falcon.lipetsk.ru>
X-Sieve: cmu-sieve 2.0
Received: from pool-141-150-203-85.delv.east.verizon.net ([141.150.203.85]:36617
"HELO compuserve.com") by falcon.lipetsk.ru with SMTP
id <S678021AbTHMGxG>; Wed, 13 Aug 2003 10:53:06 +0400
Date: Wed, 13 Aug 2003 05:55:06 +0000
From: bert at isis.msstate.edu
Subject: Re:╕Ёюшчтюфё╙тхээvх яыю∙рфш.
To: Webmaster <webmaster at lipetsk.ru>
References: <DC0DK5869LC3287J at lipetsk.ru>
In-Reply-To: <DC0DK5869LC3287J at lipetsk.ru>
Message-ID: <9D07KEE5KAL75A7I at isis.msstate.edu>
MIME-Version: 1.0
Content-Type: text/html; charset=Windows-1251
Content-Transfer-Encoding: 8bit
X-Spambayes-Exception: exceptions.UnicodeDecodeError('ascii' codec can't
decode byte 0xcf in position 3: ordinal not in range(128)) in
append() at C:\Program Files\Python\lib\email\Header.py line
272: ustr = unicode(s, incodec, errors)
------------------------
and it seems to me that exception instead of spam classification is
not what I really want. ;) However, that message (headers only) gets to the
cache OK, without any exceptions.
Could you please advice?
Thanks in advance.
- Andrew
More information about the Spambayes
mailing list