[Spambayes] fwd: robinson f(w) equation: X constant confusion

Thu Nov 28 11:52:55 2002

hi folks --

just wondering about this.  I ran some tests which wandered across
the landscape of X and S values (as used in Gary Robinson's f(w)
equation), and computed a cost figure based on a corpus of 2000
spam v. 1000 ham, then graphed it.

Results are here:

   http://spamassassin.taint.org/qa/s_x_gary.png

note that X=0.53 S=0.05 and X=0.69 S=0.32 seem to
give the best results.

However, computing X, as per Gary's webpage, results in a value of 0.32.
But according to that graph, 0.32 is pretty much crap ;)

As Allen says below:

> I can't see any reason why that would cause this - it's the same corpus
> giving an 0.32 result, after all. I'm more thinking that, as per the above,
> that the optimal robinson_x almost certainly _isn't_ a simple average of the
> p-values - especially not of the p-values computed using the robinson
> equation in the first place and using ones that have less than 10 or so
> points of data each. Something to work on at some point...

Anyone thought about this?  How did you guys come up with your X and S
figures?

(BTW same thing for Chi-squared combining is at
http://spamassassin.taint.org/qa/s_x_chi.png, if you're interested.  Note
that optimal values seem to be quite different here!)

--j.

------- Forwarded Message

Date:    Thu, 28 Nov 2002 00:45:28 -0500
From:    Ed Allen Smith <easmith@beatrice.rutgers.edu>
To:      jm@jmason.org
cc:      SpamAssassin-devel@lists.sourceforge.net
Subject: Re: [SAdev] bayes 10pcv results, pass 8

In message <20021127113958.52AA916F89@jmason.org> (on 27 November 2002 11:39:53
 +0000), jm@jmason.org (Justin Mason) wrote:
>
>Ed Allen Smith said:
>> >- 85.80 robx30
>> >   In other words, using the computed value for robinson_x as suggested
>> >   by Allen; 0.32 on my test corpus.  Didn't work :(  there was more
>> >   spillage of scores across the middle-ground.
>> 
>> Ah, well. Wait - 0.32? Fascinating. Was
>> http://spamassassin.taint.org/qa/s_x_gary.png run versus that corpus? It's
>> showing the optimal robinson_x being slightly _above_ 0.5, which may
>> indicate a different means of computing the optimal robinson_x than the
>> current method.
>
>yep, it's all run on the same corpus.

Huh. 

>Strange, isn't it?
>
>Maybe it's just illustrating some overfitting to my corpus...

I can't see any reason why that would cause this - it's the same corpus
giving an 0.32 result, after all. I'm more thinking that, as per the above,
that the optimal robinson_x almost certainly _isn't_ a simple average of the
p-values - especially not of the p-values computed using the robinson
equation in the first place and using ones that have less than 10 or so
points of data each. Something to work on at some point...

       -Allen

-- 
Allen Smith			http://cesario.rutgers.edu/easmith/
September 11, 2001		A Day That Shall Live In Infamy II
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin

------- End of Forwarded Message