[Spambayes] better Received header tokens

Mon Mar 10 21:28:28 EST 2003

[Tim]
> It was an example of harmful correlation, by way of illustrating
> why a strong indicator isn't necessarily a desirable indicator.
> This particular example applies pretty directly to any
> source from which a user rarely (but not never) gets spam, and
> leaves clues about itself.

[Skip Montanaro]
> True enough.  I'm sure there are lots of such correlations.  But if a
> person's incoming mail isn't dominated by one source, such harmful
> correlations will have less impact on the final score of any
> given message, right?

Strictly less, yes, but it's a second-order distinction and would have
trouble being *significantly* less.  Say you have H total ham and S total
spam, and that a particular token appears in h ham and s spam.  The
unadjusted spamprob for that token is then

   s/S
---------
s/S + h/H

which can be rearranged as

    H
----------
H + (h/s)S

The magnitudes of h and s don't matter to the result, nor even the
magnitudes of h and s relative to H and S -- all that matters is the ratio
of h to s.  So it makes no difference at this level whether the token
appears in 99% of your training data, or in 0.0001% of it:  if it appears in
(say) 20 times more ham msgs than spam msgs, the first-order spamprob guess
is the same whether that's a total of 20 msgs or 20 million. Or, IOW, if 1%
of my python.org mail is spam, and 1% of my guysnamedtim.com mail is spam,
and 1% of my friendsofskip.org mail is spam, a clue unique to any of those
sources gets the same first-order spamprob, and regardless of what
percentages of my total email derive from these sources.

The Bayesian adjustment goes on to fiddle the guess, taking *some* measure
of the magnitude of h+s into account, but as h+s increases it has a smaller
and smaller effect.  If I only have one msg total from guysnamedtim.com, the
adjustment is large, but unknown_word_strength is under 0.5 by default and
we approach the by-counting spamprob guess quickly as h+s increases.

> As an example, I just grep'd my ham collection for the
> Sender field, squashed case, sorted and uniq'd, then sorted again.  The
> tail end looked like
>
>      150 sender: folkmusic-admin at grassyhill.org
>      221 sender: zope-admin at zope.org
>      255 sender: folk music presenters <folkvenu at lists.psu.edu>
>      450 sender: spambayes-bounces at python.org
>      550 sender: python-checkins-admin at python.org
>      555 sender: owner-6pack at autox.team.net
>      688 sender: python-dev-admin at python.org
>      821 sender: spamassassin-talk-admin at lists.sourceforge.net
>     1387 sender: cedu-admin at manatee.mojam.com
>     3091 sender: pthon-list-admin at python.org
>
> This is out of 9609 Sender headers (just under 12,000 hams).  If
> I remember comments you've made on this topic in the past, I expect
> your Sender:  headers to be more strongly dominated by Python-related
> messages than this.

They are, but, as above, that has a minor effect on spamprobs.  What's worse
about python.org mail is that there are so *many* tokens unique to it,
and they're (equally) strong ham clues.  Of course there are two sides to
the story:  while that makes it easy for spam from python.org to rate
unsure, it also virtually guarantees that ham from python.org never rates
unsure.

> Just the presence of a Sender header irregardless of where it came from
> seems to be a pretty strong ham clue (something spammers could/do
> exploit?).
> My roughly 7,000 spams only have 759 Sender headers.

Then they're not very consistent in exploiting it <wink>.

> I haven't experimented with adding it to Options.options.address_headers,
> but your comment in tokenizer.py suggests this probably wouldn't be too
> wise.

It's on by default in the Outlook client.  It's deadly for research on
mixed-source corpora, but for live email I expect it to help.  This wasn't
formally tested, though, and should be.  I can testify from experience
that's it not deadly in real-life Outlook use <wink>.