[Spambayes] Spam Detection Adds Custom Header Entry

Fri Jan 9 00:06:58 EST 2004

[Dennis W. Bulgrien]
> My e-mail server uses a spam filter for the incoming email accounts.
> This filter adds a string into the header, X-Spam-Score, if it finds
> spam and adds an attachment explaining why it considered the email
> to be spam.  Spambayes (Outlook plug-in) apparently ignores this
> [unknown] header entry; or maybe it isn't even given the opportunity
> to parse them.

All versions of SpamBayes ignore "almost all" header lines by default.  The
header lines we look at, and the ways in which we pick them apart, were
determined by testing in the early days.  The ones that were helpful (or at
least not harmful) for everyone were included.

Your X-Spam-Score header would need special tokenization to do any good.
Examples of how to do that can be found here, but you need to change the
SpamBayes source code to try this:

    http://www.entrian.com/sbwiki/SpamCopAndAssassin

Your X-Spam-Score header is probably the same as the more-usual
X-Spam-Status header on that page.

Note that we (the project) don't add anything as a default unless it goes
thru a long-winded multi-person testing process first.  It's not clear
whether tokenizing this kind of header would help or hurt, and it hasn't
been tested.  One reason it might hurt:  many email sources that use
SpamAssassin, and especially those that keep their SA up to date, do a good
job of filtering out spam, but then the X-Spam-Status headers attached to
spams that leak through would effectively give SpamBayes a large pile of bad
clues saying "this is ham".  SpamBayes would then tend to inherit the same
weaknesses as that installation of SA, so the false negative rate could
actually increase if we paid attention to SA's ideas.  Only broad testing
could determine whether it would.  A distinct worry is that spammers will
forge bogus headers of this nature (saying that the message isn't spam).

> I also don't see Spambayes clues that score based on the presence
> of attachments, the name of the attachments, or its contents.
> In the present case, it would be helpful as all attachment names are
> identical.

Alas, the info in Content-Type and Content-Disposition headers is valuable
information, and its absence is unique to the Outlook version of SpamBayes.

Short course:  the email parser we use requires standard MIME structure, but
Outlook destroys the MIME structure of incoming messages, spraying bits and
pieces all over creation to fit into a message store that was designed long
before MIME became common.

It's really not clear which parts of the original email we actually get back
from Outlook in all cases, but in any case we squash everything we get into
one piece of plain text so that we can *synthesize* (trivially) correct MIME
structure for our parser to chew on.  Working with Outlook at this level is
full of nasty surprises, so I don't know whether that will ever improve.

If you decide to dedicate your life to reverse-engineering Outlook internals
<wink>, our Outook2000/msgstore.py's _GetMessageTextParts() is the place to
look.

> May it be known, however, that with these, Spambayes is still
> doing a very good job of catching them.

So who cares <wink>?  SA does a much better job than we do on lots of header
clues, like MSGID_FROM_MTA_SHORT, in part because we don't even try to
detect stuff like that.  It usually doesn't matter to the final outcome, but
"semantic" header clues can make all the difference in a very short spam.