[Spambayes] fwd: robinson f(w) equation: X constant confusion

Tim Peters tim_one@email.msn.com
Thu Nov 28 18:22:57 2002


[Justin Mason]
> just wondering about this.  I ran some tests which wandered across
> the landscape of X and S values (as used in Gary Robinson's f(w)
> equation), and computed a cost figure based on a corpus of 2000
> spam v. 1000 ham, then graphed it.
>
> Results are here:
>
>    http://spamassassin.taint.org/qa/s_x_gary.png
>
> note that X=0.53 S=0.05 and X=0.69 S=0.32 seem to
> give the best results.
>
> However, computing X, as per Gary's webpage, results in a value of 0.32.
> But according to that graph, 0.32 is pretty much crap ;)

Rob Hooft ran some downhill Simplex optimizations that also converged on X a
bit over 0.5, and S substantially smaller than we use by default (we use
0.45 by default).

On three different sets of test data, I measured "the average" spamprob to
be a bit over 0.5 too (it ranged from 0.52 to 0.56).

A difference is that the test data I used had about the same number of ham
as spam, while you've got a 1::2 ratio.  Are you sure you weren't using 1000
spam vs 2000 ham?  If you were, and "the true unknown word" spamprob were
about 0.5, I'd expect you to measure one near 1/3, since there would be (to
a 0th-order approximation <wink>) about twice as many ham-word spamprobs
feeding into the computed average than there were spam-word spamprobs
feeding into it, and that would drag the average below 0.5 simply due to
having more of one kind of word than the other.

IOW, Gary's suggestion for guessing x appears to me to be sensitive to the
ham::spam ratio, but the method used for guessing spamprobs tries (with
mixed results) not to be sensitive to that ratio.  Mismatching assumptions,
then.

The X=0.53 S=0.05 result is cute -- it roughly says "it's about 50-50, but
don't pay much attention to it".  I'm not sure what your cost measure is; as
we measure costs by default, an FP is charged 10, in which case the contour
lines ranging from 80 to 90 are showing the difference between one FP more
or less; this *can* make them supremely sensitive to just one or two oddball
msgs.




More information about the Spambayes mailing list