[Spambayes] How low can you go?

Tue Dec 16 01:05:04 EST 2003

[Skip Montanaro]
>> Does it include Gary's scoring change?

[Tony Meyer]
> I wasn't paying enough attention to the earlier messages: is this the
> change that means that only the strongest of the two unigrams and one
> bigram is used?  If so, then yes, it includes that.

I see that it's a cruder approximation to the suggested scoring algorithm
(which I implemented at one time).  For example, the 3-word message:

    human growth hormone

generates three sets with three tokens each:

    human, "human growth", growth
    growth, "growth hormone", hormone
    hormone, "hormone human", human    # this is an artifact of wrap-around

and the strongest token is taken from each set.  The result isn't
necessarily a tiling of the original; for example, "growth" might win in
each of the first two sets, and "hormone" in the last set, leaving "human"
out of the scored part entirely.  Probably worse, it can score more than one
systematically correlated token, such as if "growth" wins from the first set
and "growth hormone" from the second set, and "hormone" from the third set
(then we end up scoring the bigram and both its constituent words).

A tiling would pick one of these three final outcomes:

    human, growth, hormone
    "human growth", hormone
    human, "growth hormone"

Then every token contributes to the score, but no pair of systematically
correlated tokens contribute to the score.

It's harder to code a tiling method; the advantage is that tiling doesn't
have systematic flaws.  It will nevertheless be interesting to see how this
other gimmick works (as explained before, the danger in allowing
systematically correlated tokens to feed into scoring is an increase in
"spectacular failures").

BTW, it should *not* be necessary to increase max_discriminators, and doing
so can create subtle numeric problems in the inverse chi-squared function.
Without this option, in an N-token message, N tokens were candidates for
scoring; with this option, there are still exactly N candidates for scoring;
with a true tiling implementation, there are no more than N candidates for
scoring (and usually less than N).