[Spambayes] fwd: robinson f(w) equation: X constant confusion
Justin Mason
jm@jmason.org
Thu Nov 28 22:25:17 2002
Tim Peters said:
> Rob Hooft ran some downhill Simplex optimizations that also converged on X a
> bit over 0.5, and S substantially smaller than we use by default (we use
> 0.45 by default).
> On three different sets of test data, I measured "the average" spamprob to
> be a bit over 0.5 too (it ranged from 0.52 to 0.56).
Interesting!
> A difference is that the test data I used had about the same number of ham
> as spam, while you've got a 1::2 ratio. Are you sure you weren't using 1000
> spam vs 2000 ham? If you were, and "the true unknown word" spamprob were
> about 0.5, I'd expect you to measure one near 1/3, since there would be (to
> a 0th-order approximation <wink>) about twice as many ham-word spamprobs
> feeding into the computed average than there were spam-word spamprobs
> feeding into it, and that would drag the average below 0.5 simply due to
> having more of one kind of word than the other.
Actually, I've just checked -- it's not 2k:1k, it's 2k:2k. so it should
be even.
> IOW, Gary's suggestion for guessing x appears to me to be sensitive to the
> ham::spam ratio, but the method used for guessing spamprobs tries (with
> mixed results) not to be sensitive to that ratio. Mismatching assumptions,
> then.
Interesting, BTW. Do you guys use the estimated X instead of a constant?
Sounds like it could vary greatly depending on corpus ratios...
> The X=0.53 S=0.05 result is cute -- it roughly says "it's about 50-50, but
> don't pay much attention to it".
There's another "sweet spot" at X=.69 and S=.42, which mystifies me;
I would have thought that would cause more FPs, which is worst for
the cost (see below).
> I'm not sure what your cost measure is; as
> we measure costs by default, an FP is charged 10, in which case the contour
> lines ranging from 80 to 90 are showing the difference between one FP more
> or less; this *can* make them supremely sensitive to just one or two oddball
> msgs.
The cost measure is a direct copy of the spambayes one, so they can
be compared ;) (I also use TCR, the cost measure used by Ion
Androutsopoulos' papers; but being able to see "unsures" helps
us pick a good scheme which maps well into SpamAssassin scores.)
BTW an interesting factor is that those scores are measured using
a high "min prob strength" factor; I used 0.27. I'm running more tests
where this varies, and I think that'll be quite interesting too ;)
PS: while I'm here -- I'm also comparing chi2 with gary-combining. I'm
finding chi2 to have quite a few more FPs in particular, right in the 0.00
spike. Do you guys see much of this? Or have I screwed up my code with
all this constant-tweaking? ;)
--j.
More information about the Spambayes
mailing list