[spambayes-dev] Tricky false positive: US states

Mon Oct 6 15:50:34 EDT 2003

    Tim> You can consider spambayes as asking a large number of consultants
    Tim> (tokens) whether they think your new message is spam.  In fact,
    Tim> with a little squinting, you can view most learning algorithms that
    Tim> way.  The strength of a spamprob (its distance from the neutral
    Tim> 0.5) is a measure of how confident "a consultant" is about their
    Tim> judgment.  If one consultant says "well, it looks spammy to me, but
    Tim> I wouldn't bet my life on it", and that's all you know, you're
    Tim> probably not willing to bet anything that they're right (and a
    Tim> single spamprob of 0.73 is indeed in the Unsure range for most
    Tim> people).  But if 100 consultants all say that same thing, any
    Tim> learning algorithm (including a real person!) is going to be quite
    Tim> confident that the odds of them all being wrong are tiny.

I like this non-technical explanation a lot.  I think the hand-waving
description on the website should incorporate this notion.

    Tim> That's what happened here.  The rub is that getting the same
    Tim> judgment from 100 consultants isn't *really* more reliable than
    Tim> getting it from one consultant unless the consultants are
    Tim> independent -- if they are independent, very high confidence is
    Tim> fully justified.  In this case, the consultants are all related,
    Tim> biased in the same direction for a reason.

This might be worth investigating.  Can't we compute the correlation between
two tokens by keeping track of how frequently they appear in the same
message?  If we know "chicago" and "illinois" are very strongly correlated,
we can potentially choose to ignore one or the other.  This could reduce the
size of the database substantially, and also work toward a situation where
we believed more strongly -- with some justification -- that our consultants
recommendations were accurate; that a politician wasn't paying them off
behind the scenes, figuratively speaking.

It would appear that this is an O(n*n) problem, since to accurately decide
correlation between any two tokens we have to consider how each token
correlates with all others.  The problem size can probably be simplified in
various ways to avoid performing a full comparison.

Skip