[Spambayes] better Received header tokens
Tim Peters
tim.one at comcast.net
Mon Mar 10 21:28:28 EST 2003
[Tim]
> It was an example of harmful correlation, by way of illustrating
> why a strong indicator isn't necessarily a desirable indicator.
> This particular example applies pretty directly to any
> source from which a user rarely (but not never) gets spam, and
> leaves clues about itself.
[Skip Montanaro]
> True enough. I'm sure there are lots of such correlations. But if a
> person's incoming mail isn't dominated by one source, such harmful
> correlations will have less impact on the final score of any
> given message, right?
Strictly less, yes, but it's a second-order distinction and would have
trouble being *significantly* less. Say you have H total ham and S total
spam, and that a particular token appears in h ham and s spam. The
unadjusted spamprob for that token is then
s/S
---------
s/S + h/H
which can be rearranged as
H
----------
H + (h/s)S
The magnitudes of h and s don't matter to the result, nor even the
magnitudes of h and s relative to H and S -- all that matters is the ratio
of h to s. So it makes no difference at this level whether the token
appears in 99% of your training data, or in 0.0001% of it: if it appears in
(say) 20 times more ham msgs than spam msgs, the first-order spamprob guess
is the same whether that's a total of 20 msgs or 20 million. Or, IOW, if 1%
of my python.org mail is spam, and 1% of my guysnamedtim.com mail is spam,
and 1% of my friendsofskip.org mail is spam, a clue unique to any of those
sources gets the same first-order spamprob, and regardless of what
percentages of my total email derive from these sources.
The Bayesian adjustment goes on to fiddle the guess, taking *some* measure
of the magnitude of h+s into account, but as h+s increases it has a smaller
and smaller effect. If I only have one msg total from guysnamedtim.com, the
adjustment is large, but unknown_word_strength is under 0.5 by default and
we approach the by-counting spamprob guess quickly as h+s increases.
> As an example, I just grep'd my ham collection for the
> Sender field, squashed case, sorted and uniq'd, then sorted again. The
> tail end looked like
>
> 150 sender: folkmusic-admin at grassyhill.org
> 221 sender: zope-admin at zope.org
> 255 sender: folk music presenters <folkvenu at lists.psu.edu>
> 450 sender: spambayes-bounces at python.org
> 550 sender: python-checkins-admin at python.org
> 555 sender: owner-6pack at autox.team.net
> 688 sender: python-dev-admin at python.org
> 821 sender: spamassassin-talk-admin at lists.sourceforge.net
> 1387 sender: cedu-admin at manatee.mojam.com
> 3091 sender: pthon-list-admin at python.org
>
> This is out of 9609 Sender headers (just under 12,000 hams). If
> I remember comments you've made on this topic in the past, I expect
> your Sender: headers to be more strongly dominated by Python-related
> messages than this.
They are, but, as above, that has a minor effect on spamprobs. What's worse
about python.org mail is that there are so *many* tokens unique to it,
and they're (equally) strong ham clues. Of course there are two sides to
the story: while that makes it easy for spam from python.org to rate
unsure, it also virtually guarantees that ham from python.org never rates
unsure.
> Just the presence of a Sender header irregardless of where it came from
> seems to be a pretty strong ham clue (something spammers could/do
> exploit?).
> My roughly 7,000 spams only have 759 Sender headers.
Then they're not very consistent in exploiting it <wink>.
> I haven't experimented with adding it to Options.options.address_headers,
> but your comment in tokenizer.py suggests this probably wouldn't be too
> wise.
It's on by default in the Outlook client. It's deadly for research on
mixed-source corpora, but for live email I expect it to help. This wasn't
formally tested, though, and should be. I can testify from experience
that's it not deadly in real-life Outlook use <wink>.
More information about the Spambayes
mailing list