[Spambayes-checkins] spambayes/Outlook2000 addin.py,1.8,1.9 classify.py,1.6,1.7 filter.py,1.7,1.8 manager.py,1.13,1.14 train.py,1.4,1.5

Tim Peters tim.one@comcast.net
Sun, 20 Oct 2002 22:42:39 -0400


[Mark Hammond]
> Modified Files:
> 	addin.py classify.py filter.py manager.py train.py
> Log Message:
> Standardize where messages are pulled apart into a text stream so
> everyone is consistent.  Append *both* the HTML body and the plain
> text body to the stream (some spam has the payload in *both*)

This was a good idea.  The problem is that there's no MIME structure in the
generated string, so the tokenizer doesn't "see" either the HTML body or the
plain text body.  This is a problem for every msg with a MIME multipart or
alternative Content-Type in the original headers, HTML or not:  the
Content-Type header specifies a boundary tag to look for in the body, but
the boundary tag doesn't exist in the body in this reconstituted string.

I started to suspect something fishy when I saw that "naked" in a ham msg
had a neutral spamprob.  It's because the tokenizer has rarely *seen* the
"naked"s in the 100s of porn spams I trained on -- it doesn't see anything
from the body in most of them.  The suspicion intensified when some
screamingly obvious spam showed up in my Unsure folder.

The good news is that almost all of my spam is getting caught anyway,
despite that most spam is getting judged solely by the tiny subset of header
lines we don't ignore by default!  In effect, it's rediscovered my one
previous by-hand spam rule:  "If it came from my MSN account, it's spam".
But it's a hell of a lot better than that was, even running nearly blind.

Unclear to me what to do about this; Outlook doesn't make life easy here.