[Spambayes] Proposing to remove 4 combining schemes
Tim Peters
tim.one@comcast.net
Thu Oct 17 15:58:31 2002
[Tim, suggests to remove use_z_combining]
>>
[Rob Hooft]
> I guess that means that no RMS magic can help here. Go ahead.
I really don't know, but I don't *see* a way. It's the normIP() results
that are assumed to be unit-normal, and that happens iff the input probs are
uniformly distributed. But the deviation of the latter from uniformity
doesn't have any bad consequence I can detect -- to the contrary, if
anything, it seems to make the ham-vs-spam decision easier.
[on the 3 remaining schemes]
>> Indeed, it's possible (no experiments have been done on this) that
>> a "hard" msg for one scheme could benefit via getting scored again
>> by one or both of the others.
> I don't expect a lot from that. You and I at least have repeatedly seen
> the same fp and fn's across methods.
The same final decision, yes, but in at least my cases the *relative* scores
across schemes are quite different. For example, even my worst FP, which
scores nearly 1.0000000000000 under chi-combining, doesn't have a
particularly high score under Gary-combining *when compared against* the
universe of genuine-spam scores under Gary-combining. The few clues that
this FP were posted by a real person count a lot under the latter. Not
enough to drag it into ham territory (and nothing ever will do that), and
not even enough to drag into what could be reasonably called a middle ground
for Gary-combining, but still below the mean for Gary-combining spam scores.
The same is true of my other deadly-bad FP under chi-combining, but even
more so.
I expect the same is true of Alex's data, because his first reaction when
trying the more-extreme tim-combining (but far less extreme than chi-) was
despair over how much *more* extreme his FP got. I assume they score 1.0
under chi-combining.
So the idea to try here (which remains untested) would be to broaden chi's
middle ground via thinking twice when Gary-combining is much less sure of a
msg. This needs precise fleshing out before it can be tested, though.
Note that the 3 remaining schemes all compute products of prods and of
1-prods, and the loopy bit doing that is the expensive part of scoring.
Getting the 3 final measures out of that is really cheap.
[on extreme vs non-extreme]
> But it is unrealistic. Think about the original problem again: "why
> can't software that classifies ham/spam be very easy? Almost all spam's
> scream in your face that they are". With chi_squared combining we found
> a method that agrees with this. Most messages scream either "Ham" or
> "Spam", and there is very little left to doubt.
It could be that the UI would be better off with a "ham", "spam", "unsure"
string tag than with decimal digits of precision.
> You can downscale things a bit by reducing the final S,H-score in
> chi_squared combining before calling chi2Q. Maybe take the sqrt or
> something similar.
Not really attractive; sqrt would be far too gross a distortion, btw (e.g.,
it would change a score of 0.5 to 0.0 -- the mean is 2*n and the sdev
2*sqrt(n)).
> ...
> Maybe the better answer is that the final UI shouldn't throw the scores
> in your face.
Possibly. For now it's helpful to me, since I'm a developer and really need
a window on the internals.
> ...
> Did you ever try tim combining with (S-H+1)/2?
No, but it would be an excellent idea to try it with the current default
combining! tim-combining is unique in that its S is especially sensitive to
*low*-spamprob words, and its H to high-spamprob words; when something
really is spam, tim-combining isn't relying so much on having a high S value
as on having a low H value, so that the ratio S/(S+H) approaches 1.
Gary-combining is much more like chi-combining in these respects, and
chi-combining is where the (S-H+1)/2 reformulation helped.