[Spambayes] Chi**2 results

Rob Hooft rob@hooft.net
Sun, 13 Oct 2002 09:00:24 +0200


Tim Peters wrote:

> Looks good too <wink>.  One part is *too* good:
> 
> -> <stat> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03
> -> <stat> min -2.22045e-13; median 9.99201e-14; max 100
>           ^^^^^^^^^^^^^^^^

I noticed that... But indeed, one cannot blame the program if it is 
calculating chi2Q with 14 digit accuracy and then subtract it from 1.0.....

> I don't think any scheme can afford to throw msgs away entirely. 

I have to admit that I do have a "spam" folder from SA at this moment, 
and that I am only "scanning" the index page of this for 3 seconds per 
week.... That is almost as good as throwing them out completely.

A good feature of spamassassin is that it turns every suspect message 
into text/plain. This would be a good feature for the middle-ground 
messages (but it should be easy to undo somehow for 
middle-ground-negatives).

> So it's at best a mixed bag.  I don't know of a computationally cheap way to
> take correlations into account, else I would have tried that before
> resorting to stripping HTML tags (I hate throwing info away).

We'd just have to make a 100k*100k correlation matrix. Programmatically 
very cheap ;-)

I'm currently looking at the H and S values of middle ground messages. I 
have seen a few H+S>1.9 so far. Advantage of the current schema is that 
if H+S>1.25, the message is always at least in the middle ground. H+S<<1 
are quite rare with this schema, but I've seen some with H=0.05 S=0.02 
and will investigate whether something can be gained (sure fp/fn) in 
that area.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/