[spambayes-dev] default to mine_received_headers=True, "may be forged"

Tim Peters tim.one at comcast.net
Sun Dec 21 01:21:25 EST 2003


Good news and bad news on mine_received_headers in my classifier now.

The good news is that it *generally* made ham hammier and spam spammier.

The bad news is that spam leaking thru python.org mailing lists is much more
likely to score as ham than unsure as before, due to the large number of new
python.org-related clues.  The lowest-scoring spam in my training data now
is this:

"""
...
Subject: HOT OPPORTUNITY
...

    JUST CHECK OUT MY WEBSITE

http://www.webspawner.com/users/hawkk/index.html
--
http://mail.python.org/mailman/listinfo/python-list
"""

It turns out I've actually trained on two copies of that one, but despite
that it's scoring only 19 now:

Combined Score: 19% (0.193802)
Internal ham score (*H*): 0.85051
Internal spam score (*S*): 0.238114

These are all the "ah, this came from a python.org mailing list" features
now, more than doubling the number of such features before:

'url:mailman'                       0.128016
'url:listinfo'                      0.130533
'url:python'                        0.135874
'bi:proto:http url:mail'            0.138712
'url:python-list'                   0.145499
'received:127'                      0.146801
'received:127.0'                    0.146801
'received:127.0.0'                  0.146801
'received:127.0.0.1'                0.146801
'bi:received:12.155.117.29 received:localdomain' 0.1549
'received:localhost.localdomain'    0.16481
'sender:addr:python-list-bounces+tim.one=comcast.net' 0.168566
'sender:addr:python.org'            0.168824
'received:12'                       0.211812
'bi:to:addr:python.org to:no real name:2**0' 0.213042
'received:12.155'                   0.214529
'received:12.155.117'               0.214529
'received:mail.python.org'          0.214529
'received:python.org'               0.214529
'url:org'                           0.221085

So it's got 11(!) new correlated clues extracted from two Received headers:

Received: from mail.python.org ([12.155.117.29])
	by sccrmxc14.comcast.net (sccrmxc14) with ESMTP
	id <20031211091604s14001ch25e>; Thu, 11 Dec 2003 09:16:04 +0000
X-Originating-IP: [12.155.117.29]
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
	by mail.python.org with esmtp (Exim 4.22) id 1AUMvU-0000lU-7N
	for tim.one at comcast.net; Thu, 11 Dec 2003 04:16:04 -0500

If I were doing train-on-everything instead of just mistakes, I'm afraid the
spamprobs on the python.org clues would approach 0 (I get a couple hundred
ham from mail.python.org every day, but typically no spam from there) --
then we'd be close to "spectacular failure" territory, for such very short
spam.

Something to be aware of, anyway!

On the other side, all the ham in my training data scores 0 now (rounded to
two digits), which I've never seen before.  That's remarkable since the only
ham in there came from mistakes and unsures (50 left over from my unigram
classifier, about 100 added since then).  Only 5 training spam don't score
100 (rounded), which are exactly the 5 training spam that came from a
python.org mailing list.  Overall, that's also better than I've seen before,
although the bit of python.org spam is doing worse than I've seen before
(for the obvious reason explained above).




More information about the spambayes-dev mailing list