[Spambayes] How low can you go?

Sun Dec 14 15:58:17 EST 2003

[Bill Yerazunis]
> Do you have a good writeup of chi-combining online?
>
> I still haven't quite managed to wrap my head around it from what
> I've read.

Gary Robinson wrote a good article about it in the Linux Journal:

    http://www.linuxjournal.com/article.php?sid=6467

I can't guess whether it would help you without a concept of Unsure
messages, though.  A dramatic effect of chi-combining was transforming the
"spectacular failures" under Bayesian combining into solid Unsures.  For
example, when there's a ton of evidence in both directions, Bayesian
combining tends toward an absurd level of confidence in its guess, and
regardless of whether its guess is right or wrong, but chi-combining
reliably returns a score near 0.50 in such cases.

A mental hangup at first may be that chi-combining doesn't even pretend to
compute the probability that a msg is spam, and only confusion can come from
supposing that it does.  It produces "a score", with no intuitive meaning
other than that "near 1.0 means spam, near 0.0 means ham, and close to 0.5
means the evidence is so mixed that the system can't guess".

> ...
> Here's the quick description...
>
> Right now, the way CRM114 calculates local probabilities for each
> polynomial on each window is biased strongly toward 0.5.  The
> previous formula that worked best was:
>
>               hits_this_corpus - (hits_all_corpus - hits_this_corpus)
>       0.5 + ---------------------------------------------------------
>                         16 * ( hits_all_corpus + 1)
>
>
> which as you can see, needs a LOT of corpus hits on a feature to get
> much certainty out of it.

> For a corpus hapax that recurs in the test message, the local
> probability would be 0.5 + 1/32 = .53125
>
> With 16 hits on it, all spam, the local probability would be 0.5 +
> 16/(16*17) = .5588
>
> ...um... that's not right.

Well, it follows from the equation <wink>.  If there are N hits, all spam,
it's

    0.5 + N/(16*(N+1)) = 0.5 + N/(N+1) * 1/16

which approaches 0.5 + 1/16 = 0.5625 from below as N approaches infinity.

> and now that I'm staring at it... I'm wondering how this could
> ever have worked,

Keeping spamprobs very mild would help Bayesian combining avoid "spectacular
failures", and that may have played into it.  Have to agree it doesn't seem
right anyway, though.

> what I was thinking when I wrote it,

You're on your own there <wink>.

> and the shocking realization that I need to spend a LOT more time
> checking my code...

Or figuring out why it works so well despite parts being crazy -- that
sometimes turns up unexpected insight!  A coworker and I recently spent a
lot of time simulating various schemes for implementing a second-level
client cache, based on running captured traces of cache requests from live
installations.  One particular relatively unsophisticated approach
consistently gave dramatically better results than others, darned near
unbelievably better.  After staring at it for more than a week (off & on),
it turned out that this scheme won huge *mostly* because it favored evicting
larger objects over evicting smaller objects, which left it with a lot more
objects sitting in the fixed-sized cache, thus enormously boosting the hit
rate.  So it turned out the single most important factor feeding into the
hit rate is one we weren't even aware of at first, and that one particular
algorithm happened to pay some attention to it was more accident than
design.