[spambayes-dev] Tricky false positive: US states

Sat Oct 4 18:11:45 EDT 2003

[Richie Hindle]
> Here's an interesting false positive: I asked an American colleague a
> question about US state codes, and he emailed me a copy of this page
> from the US Post Office website:
>
>   http://www.usps.com/ncsc/lookups/usps_abbreviations.html
>
> Now that scored as pretty solid spam for me (0.99075) because all the
> state names are slight spam clues - most of my spam comes from the
> USA. Here's a snippet of the X-Spambayes-Evidence header:
>
>   'lock': 0.73; 'louisiana': 0.73; 'marshall': 0.73; 'missouri': 0.73;
>   'mount': 0.73; 'nebraska': 0.73; 'ohio': 0.73; 'parkway': 0.73;
>   'pennsylvania': 0.73; 'plz': 0.73; 'rad': 0.73; 'square': 0.73;
>   'tennessee': 0.73; 'texas': 0.73; 'trl': 0.73; 'valley': 0.73;
>
> and so on.  All those fifty sightly-spammy state names add up to a big
> spam score.
>
> Most of them are hapaxes, but that's not very relevant - it's just a
> result of not having a very big training set (~600 messages).
>
> Not sure whether there's anything we can do about it (or even whether
> we should consider doing anything about it) but I thought it was
> interesting.

Indeed it is.  You can consider spambayes as asking a large number of
consultants (tokens) whether they think your new message is spam.  In fact,
with a little squinting, you can view most learning algorithms that way.
The strength of a spamprob (its distance from the neutral 0.5) is a measure
of how confident "a consultant" is about their judgment.  If one consultant
says "well, it looks spammy to me, but I wouldn't bet my life on it", and
that's all you know, you're probably not willing to bet anything that
they're right (and a single spamprob of 0.73 is indeed in the Unsure range
for most people).  But if 100 consultants all say that same thing, any
learning algorithm (including a real person!) is going to be quite confident
that the odds of them all being wrong are tiny.

That's what happened here.  The rub is that getting the same judgment from
100 consultants isn't *really* more reliable than getting it from one
consultant unless the consultants are independent -- if they are
independent, very high confidence is fully justified.  In this case, the
consultants are all related, biased in the same direction for a reason.

I doubt there's any sensible way to deal with that short of having the
semantic knowledge we apply to the problem ("well, these aren't just 50
independent words, they're all names of states in America -- so how many
spam and ham have I seen in the past containing lots of American state
names?" -- and knowing *that* would indeed be an excellent consultant).

It doesn't surprise me that such glitches exist in a mindless statistical
algorithm, but I've been surprised all along that they don't occur more
often.  That consulting a few thousands of morons gets the right answer so
often gives me hope for democracy <wink>.

> [ Ah, no, hang on, I *do* have an idea, but it's mostly outside the
>   remit of Spambayes.  Mail that never went outside my organisation
>   shouldn't be marked as spam.  All the Received headers show the
>   mail moving within my organisation.  So I want some kind of plug-in
>   system whereby I can use the Spambayes tokeniser, header analysis
>   and so on to make my own decisions that override the classifier.
>   Once my army of winged monkeys has finished their Python training
>   course I'll get them onto it. ]

Sean True and Mark Hammond have talked about generalizing the spambayes
framework to allow plugging in other kinds of rules.  I don't know whether
it's being pursued now.