[Spambayes] Proposing to remove 4 combining schemes

Tim Peters tim.one@comcast.net
Thu Oct 17 15:58:31 2002


[Tim, suggests to remove use_z_combining]
>>

[Rob Hooft]
> I guess that means that no RMS magic can help here. Go ahead.

I really don't know, but I don't *see* a way.  It's the normIP() results
that are assumed to be unit-normal, and that happens iff the input probs are
uniformly distributed.  But the deviation of the latter from uniformity
doesn't have any bad consequence I can detect -- to the contrary, if
anything, it seems to make the ham-vs-spam decision easier.

[on the 3 remaining schemes]
>> Indeed, it's possible (no experiments have been done on this) that
>> a "hard" msg for one scheme could benefit via getting scored again
>> by one or both of the others.

> I don't expect a lot from that. You and I at least have repeatedly seen
> the same fp and fn's across methods.

The same final decision, yes, but in at least my cases the *relative* scores
across schemes are quite different.  For example, even my worst FP, which
scores nearly 1.0000000000000 under chi-combining, doesn't have a
particularly high score under Gary-combining *when compared against* the
universe of genuine-spam scores under Gary-combining.  The few clues that
this FP were posted by a real person count a lot under the latter.  Not
enough to drag it into ham territory (and nothing ever will do that), and
not even enough to drag into what could be reasonably called a middle ground
for Gary-combining, but still below the mean for Gary-combining spam scores.
The same is true of my other deadly-bad FP under chi-combining, but even
more so.

I expect the same is true of Alex's data, because his first reaction when
trying the more-extreme tim-combining (but far less extreme than chi-) was
despair over how much *more* extreme his FP got.  I assume they score 1.0
under chi-combining.

So the idea to try here (which remains untested) would be to broaden chi's
middle ground via thinking twice when Gary-combining is much less sure of a
msg.  This needs precise fleshing out before it can be tested, though.

Note that the 3 remaining schemes all compute products of prods and of
1-prods, and the loopy bit doing that is the expensive part of scoring.
Getting the 3 final measures out of that is really cheap.

[on extreme vs non-extreme]
> But it is unrealistic. Think about the original problem again: "why
> can't software that classifies ham/spam be very easy? Almost all spam's
> scream in your face that they are". With chi_squared combining we found
> a method that agrees with this. Most messages scream either "Ham" or
> "Spam", and there is very little left to doubt.

It could be that the UI would be better off with a "ham", "spam", "unsure"
string tag than with decimal digits of precision.

> You can downscale things a bit by reducing the final S,H-score in
> chi_squared combining before calling chi2Q. Maybe take the sqrt or
> something similar.

Not really attractive; sqrt would be far too gross a distortion, btw (e.g.,
it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev
2*sqrt(n)).

> ...
> Maybe the better answer is that the final UI shouldn't throw the scores
> in your face.

Possibly.  For now it's helpful to me, since I'm a developer and really need
a window on the internals.

> ...
> Did you ever try tim combining with (S-H+1)/2?

No, but it would be an excellent idea to try it with the current default
combining!  tim-combining is unique in that its S is especially sensitive to
*low*-spamprob words, and its H to high-spamprob words; when something
really is spam, tim-combining isn't relying so much on having a high S value
as on having a low H value, so that the ratio S/(S+H) approaches 1.
Gary-combining is much more like chi-combining in these respects, and
chi-combining is where the (S-H+1)/2 reformulation helped.