[Spambayes] Perhaps a level header would be useful?

Tim Peters tim.one at comcast.net
Tue Mar 11 20:33:25 EST 2003


[Meyer, Tony]
> ...
> The change I made was to replace line 245 ("prob = (S-H + 1.0) /
> 2.0") of classifier.py with:
> """
>             from math import log
>             if H == 0:
>                 H = 0.00000001
>             if S == 0:
>                 S = 0.00000001
>             prob = ((-(log(S) - log(H)))/350) + 0.5
> """

Apart from the technical glitches you bumped into, there's a reason we don't
want to combine H and S via any expression of this form.  Because the
difference of logs is the log of the quotient, and the negation of a log is
the log of the reciprocal, the heart of this expression is log(H/S), and
it's the H/S part that's undesirable.

If, say, H is 0.99, and S is 0.0099, H/S is 100 and there's no problem with
concluding that we're sure the msg is ham.

But suppose H is .0001 and S is .000001.  Then H/S is also 100, but it's
plain nuts to be exactly as sure that the msg is ham:  H on its own says the
system thinks there's virtually no chance the msg looks like what it's been
taught about ham, and the low S says the same about what it's been taught
about spam:  it doesn't look like either, so Unsure is the "proper"
response.  If the system *had* to guess one or the other, then ham is the
best guess it can make, but H on its own says the system doesn't believe
that guess.  (Note that in pH calculations, small magnitudes don't "say"
anything significant -- a factor of 100 is equally signficant in that domain
no matter how small the input magnitudes.)

Rob Hooft crafted the simple combining formula we use to give a high
combined score in the first example and a solid Unsure in the second
example.  We used a different expression involving a ratio before that, and
examples of the second kind are exactly where it screwed up.  Don't want to
do that again <wink>.

BTW, and IIRC, cmp.py never got updated to deal sensibly with unsures.  If
that's right, it shouldn't be used except when spam_cutoff == ham_cutoff.
Then you've got a two-outcome classifier (no unsures), and cmp.py won't
"forget" any msgs.




More information about the Spambayes mailing list