[Spambayes] Chi**2 results
Rob Hooft
rob@hooft.net
Sun, 13 Oct 2002 09:00:24 +0200
Tim Peters wrote:
> Looks good too <wink>. One part is *too* good:
>
> -> <stat> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03
> -> <stat> min -2.22045e-13; median 9.99201e-14; max 100
> ^^^^^^^^^^^^^^^^
I noticed that... But indeed, one cannot blame the program if it is
calculating chi2Q with 14 digit accuracy and then subtract it from 1.0.....
> I don't think any scheme can afford to throw msgs away entirely.
I have to admit that I do have a "spam" folder from SA at this moment,
and that I am only "scanning" the index page of this for 3 seconds per
week.... That is almost as good as throwing them out completely.
A good feature of spamassassin is that it turns every suspect message
into text/plain. This would be a good feature for the middle-ground
messages (but it should be easy to undo somehow for
middle-ground-negatives).
> So it's at best a mixed bag. I don't know of a computationally cheap way to
> take correlations into account, else I would have tried that before
> resorting to stripping HTML tags (I hate throwing info away).
We'd just have to make a 100k*100k correlation matrix. Programmatically
very cheap ;-)
I'm currently looking at the H and S values of middle ground messages. I
have seen a few H+S>1.9 so far. Advantage of the current schema is that
if H+S>1.25, the message is always at least in the middle ground. H+S<<1
are quite rare with this schema, but I've seen some with H=0.05 S=0.02
and will investigate whether something can be gained (sure fp/fn) in
that area.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/