[Spambayes] fwd: robinson f(w) equation: X constant confusion

Thu Nov 28 22:25:17 2002

Tim Peters said:
> Rob Hooft ran some downhill Simplex optimizations that also converged on X a
> bit over 0.5, and S substantially smaller than we use by default (we use
> 0.45 by default).
> On three different sets of test data, I measured "the average" spamprob to
> be a bit over 0.5 too (it ranged from 0.52 to 0.56).

Interesting!

> A difference is that the test data I used had about the same number of ham
> as spam, while you've got a 1::2 ratio.  Are you sure you weren't using 1000
> spam vs 2000 ham?  If you were, and "the true unknown word" spamprob were
> about 0.5, I'd expect you to measure one near 1/3, since there would be (to
> a 0th-order approximation <wink>) about twice as many ham-word spamprobs
> feeding into the computed average than there were spam-word spamprobs
> feeding into it, and that would drag the average below 0.5 simply due to
> having more of one kind of word than the other.

Actually, I've just checked -- it's not 2k:1k, it's 2k:2k.  so it should
be even.

> IOW, Gary's suggestion for guessing x appears to me to be sensitive to the
> ham::spam ratio, but the method used for guessing spamprobs tries (with
> mixed results) not to be sensitive to that ratio.  Mismatching assumptions,
> then.

Interesting, BTW.  Do you guys use the estimated X instead of a constant?
Sounds like it could vary greatly depending on corpus ratios...

> The X=0.53 S=0.05 result is cute -- it roughly says "it's about 50-50, but
> don't pay much attention to it".

There's another "sweet spot" at X=.69 and S=.42, which mystifies me;
I would have thought that would cause more FPs, which is worst for
the cost (see below).

> I'm not sure what your cost measure is; as
> we measure costs by default, an FP is charged 10, in which case the contour
> lines ranging from 80 to 90 are showing the difference between one FP more
> or less; this *can* make them supremely sensitive to just one or two oddball
> msgs.

The cost measure is a direct copy of the spambayes one, so they can
be compared ;)  (I also use TCR, the cost measure used by Ion
Androutsopoulos' papers; but being able to see "unsures" helps
us pick a good scheme which maps well into SpamAssassin scores.)

BTW an interesting factor is that those scores are measured using
a high "min prob strength" factor; I used 0.27.  I'm running more tests
where this varies, and I think that'll be quite interesting too ;)

PS: while I'm here -- I'm also comparing chi2 with gary-combining.  I'm
finding chi2 to have quite a few more FPs in particular, right in the 0.00
spike.  Do you guys see much of this?  Or have I screwed up my code with
all this constant-tweaking? ;)

--j.