[Spambayes] RE: Central Limit Theorem??!! :)
Gary Robinson
grobinson@transpose.com
Mon, 23 Sep 2002 10:06:47 -0400
Error:
> When training on the spam side, don't use f(w), use: ln (1-f(w)).
>
> When training on the ham side, don't use f(w), use: ln f(w).
is backwards. Do exactly the reverse:
When training on the spam side, don't use f(w), use: ln f(w).
When training on the ham side, don't use f(w), use: ln (1-f(w)).
Also, those numbers will be negative. If you want positive numbers, use -(ln
f(w)) and -(ln (1-f(w))). The minus sign will make no difference whatsoever
to the results.
--Gary
--
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454
> From: Gary Robinson <grobinson@transpose.com>
> Date: Mon, 23 Sep 2002 08:46:10 -0400
> To: Tim Peters <tim.one@comcast.net>, SpamBayes <spambayes@python.org>, Greg
> Louis <glouis@dynamicro.on.ca>
> Subject: Re: [Spambayes] RE: Central Limit Theorem??!! :)
>
> OK. That's fascinating.
>
> Remember that the multiplicative method in S, which calculates the geometric
> mean of the f(w)'s, stresses the MOST extreme values more than the less
> extreme ones. The more extreme the value is, the more it is stressed. Very
> extreme ones are stressed very highly, in an exponentially compounding way if
> there are several really extreme ones.
>
> That's why it's the basis for that 1971 optimality theorem that I kept trying
> to invoke more strongly, but which is still invoked to a degree in S.
>
> That's the reasoning behind S in the first place, and why it works so well,
> and it also happens in Graham's original approach (but only on one side), but
> we are completely losing it in R.
>
> In R, we are trading that very powerful multiplicative effect away in order to
> get the benefit of real parametric statistics. ALSO a very powerful technique
> but apparently slightly less powerful in this application -- at least when
> used alone.
>
> If there is a performance loss in R (and there are no remaining coding
> errors), I am confident that that's why.
>
> THERE IS A POTENTIAL FIX FOR THIS LOSSAGE, so that we can theoretically get
> the best of both totally different techniques.
>
> When training on the spam side, don't use f(w), use: ln (1-f(w)).
>
> When training on the ham side, don't use f(w), use: ln f(w).
>
> Same when testing. Don't add the f(w)'s increating the sample mean; add the
> expressions above, and divide by n. So the spam side uses ln (1-f(w)) both for
> training and calculating the sample means, and the ham side uses ln f(w) for
> both.
>
> As we've discussed, averaging the ln's is the same thing as a geometric mean
> if you then subsequently raise e to the power of that computed average. But we
> don't do that last step here. We just feed the arithmetic mean of the ln's
> into the z-score calcs.
>
> This *should* bring us the benefits of the multiplicative approach and of the
> parametric stats approach at the same time.
>
> The downside of this is that the ln's make such a skewed distribution that it
> should take a bigger n to make the central limit theorem kick in. BUT, OTOH,
> it WILL kick in, and something like 30 may really still be enough (it usually
> kicks in at significantly smaller numbers). And you've also successfully used
> n=150 with f(w) and that is DEFINITELY enough.
>
> The other downside is that it just seems like a bit of a wild thing to do and
> sometimes when you do wild things, strange reasons emerge why they won't work.
> But I really can't see any at this point as long as n is big enough that the
> sample means take on a normal distribution.
>
> THANKS for doing all the coding work to test this idea!!!! :)
>
> Gary
>
>
>
>
>
>>
>> This was using 30 extremes, and using Graham's p(w) (complete with hambias
>> 2, minprob .01 and maxprob .99). The f-n rate was more than 10x worse using
>> f(w) with a=x=0.5, and I have no idea why yet (and we're *generally* having
>> problems with f-n rates on smaller training sets when using f(w), whether
>> using the central-limit scoring, or Gary's previous scoring; perhaps 'a'
>> needs to be much smaller than 0.5 -- there's too much to test here).
>>
>> Here are the aggregate scaled R values (clamped to [-20, 20], and then
>> scaled linearly into [0, 1]):
>>
>> Ham distribution for all runs:
>> 5000 items; mean 0.35; sample sdev 3.62
>> * = 82 items
>> 0.00 4907 ************************************************************
>> 2.50 13 *
>> 5.00 14 *
>> 7.50 12 *
>> 10.00 4 *
>> 12.50 11 *
>> 15.00 6 *
>> 17.50 7 *
>> 20.00 4 *
>> 22.50 2 *
>> 25.00 2 *
>> 27.50 3 *
>> 30.00 0
>> 32.50 1 *
>> 35.00 3 *
>> 37.50 0
>> 40.00 1 *
>> 42.50 1 *
>> 45.00 1 *
>> 47.50 1 *
>> 50.00 1 *
>> 52.50 0
>> 55.00 1 *
>> 57.50 0
>> 60.00 0
>> 62.50 1 *
>> 65.00 2 *
>> 67.50 0
>> 70.00 0
>> 72.50 0
>> 75.00 0
>> 77.50 0
>> 80.00 0
>> 82.50 0
>> 85.00 0
>> 87.50 0
>> 90.00 0
>> 92.50 0
>> 95.00 0
>> 97.50 2 *
>>
>> Spam distribution for all runs:
>> 5000 items; mean 98.97; sample sdev 5.87
>> * = 80 items
>> 0.00 0
>> 2.50 0
>> 5.00 0
>> 7.50 0
>> 10.00 0
>> 12.50 0
>> 15.00 0
>> 17.50 0
>> 20.00 1 *
>> 22.50 0
>> 25.00 0
>> 27.50 3 *
>> 30.00 1 *
>> 32.50 1 *
>> 35.00 1 *
>> 37.50 1 *
>> 40.00 1 *
>> 42.50 0
>> 45.00 1 *
>> 47.50 3 *
>> 50.00 6 *
>> 52.50 5 *
>> 55.00 5 *
>> 57.50 3 *
>> 60.00 7 *
>> 62.50 6 *
>> 65.00 9 *
>> 67.50 11 *
>> 70.00 19 *
>> 72.50 11 *
>> 75.00 8 *
>> 77.50 10 *
>> 80.00 13 *
>> 82.50 9 *
>> 85.00 11 *
>> 87.50 16 *
>> 90.00 31 *
>> 92.50 8 *
>> 95.00 15 *
>> 97.50 4784 ************************************************************
>>
>> All of the false positives had significant numbers of both 0.01 and 0.99
>> clues. This seems to be a reappearance of the p(w) "cancellation disease"
>> that we wormed around before by adding gobs of special-case code to Graham
>> scoring. Several of the fns also had this problem. The outcome is like
>> flipping a coin when this happens. Note that f(w) doesn't have this problem
>> (it's mostly an artifact of that p(w) artificially clamps probabilities, and
>> so many words end up with probs at the extreme values).
>>
>> Here are the means and variances of the training data scaled R values:
>>
>> hammean 0.0315194110198 hamvar 0.0102392908745
>> spammean 0.977596060549 spamvar 0.00629493144389
>>
>> hammean 0.0289322128628 hamvar 0.00860263576484
>> spammean 0.976784463455 spamvar 0.00703754635535
>>
>> hammean 0.0292168061706 hamvar 0.00850922282341
>> spammean 0.977330456163 spamvar 0.00656386045426
>>
>> hammean 0.0292418489626 hamvar 0.00869327431102
>> spammean 0.972324957985 spamvar 0.00968199258783
>>
>> hammean 0.0266295579103 hamvar 0.00745682458391
>> spammean 0.974432833096 spamvar 0.00865812701944
>>
>>
>> Finally, here's the same thing (including exactly the same messages in the
>> training and prediction sets) all over again, *except* using f(w) with a=0.1
>> and x=0.5 (I mentioned a=0.5 above; I lowered it again for this run, and
>> that did help the f-n rate, but not much):
>>
>> 0.000 3.700
>> 0 new false positives
>> 37 new false negatives
>>
>> 0.000 2.500
>> 0 new false positives
>> 25 new false negatives
>>
>> 0.000 4.800
>> 0 new false positives
>> 48 new false negatives
>>
>> 0.100 2.900
>> 1 new false positives
>> 29 new false negatives
>>
>> 0.100 3.400
>> 1 new false positives
>> 34 new false negatives
>>
>> total unique false pos 2
>> total unique false neg 173
>> average fp % 0.04
>> average fn % 3.46
>>
>> Ham distribution for all runs:
>> 5000 items; mean 0.05; sample sdev 1.57
>> * = 84 items
>> 0.00 4991 ************************************************************
>> 2.50 2 *
>> 5.00 0
>> 7.50 1 *
>> 10.00 0
>> 12.50 0
>> 15.00 0
>> 17.50 0
>> 20.00 1 *
>> 22.50 1 *
>> 25.00 1 *
>> 27.50 1 *
>> 30.00 0
>> 32.50 0
>> 35.00 0
>> 37.50 0
>> 40.00 0
>> 42.50 0
>> 45.00 0
>> 47.50 0
>> 50.00 0
>> 52.50 0
>> 55.00 0
>> 57.50 1 *
>> 60.00 0
>> 62.50 0
>> 65.00 0
>> 67.50 0
>> 70.00 0
>> 72.50 0
>> 75.00 0
>> 77.50 1 *
>> 80.00 0
>> 82.50 0
>> 85.00 0
>> 87.50 0
>> 90.00 0
>> 92.50 0
>> 95.00 0
>> 97.50 0
>>
>> Spam distribution for all runs:
>> 5000 items; mean 94.82; sample sdev 15.16
>> * = 69 items
>> 0.00 5 *
>> 2.50 2 *
>> 5.00 1 *
>> 7.50 5 *
>> 10.00 8 *
>> 12.50 6 *
>> 15.00 10 *
>> 17.50 8 *
>> 20.00 6 *
>> 22.50 14 *
>> 25.00 10 *
>> 27.50 11 *
>> 30.00 9 *
>> 32.50 4 *
>> 35.00 12 *
>> 37.50 12 *
>> 40.00 11 *
>> 42.50 8 *
>> 45.00 16 *
>> 47.50 15 *
>> 50.00 21 *
>> 52.50 21 *
>> 55.00 20 *
>> 57.50 27 *
>> 60.00 21 *
>> 62.50 19 *
>> 65.00 31 *
>> 67.50 18 *
>> 70.00 18 *
>> 72.50 22 *
>> 75.00 22 *
>> 77.50 37 *
>> 80.00 36 *
>> 82.50 50 *
>> 85.00 59 *
>> 87.50 50 *
>> 90.00 67 *
>> 92.50 74 **
>> 95.00 76 **
>> 97.50 4138 ************************************************************
>>
>> Too bizarre for me -- there may be a gross bug here, but the central-limit
>> code is exactly the same in both cases, and the f(w) code is exactly the
>> same as I've been using with good results for a few days.
>>
>> hammean 0.083575863227 hamvar 0.0384359627039
>> spammean 0.978952388668 spamvar 0.00372443106446
>>
>> hammean 0.075515986459 hamvar 0.0331126798862
>> spammean 0.97742506528 spamvar 0.00420869219347
>>
>> hammean 0.0776481612081 hamvar 0.0338362239317
>> spammean 0.978462136207 spamvar 0.00341505065601
>>
>> hammean 0.07972071882 hamvar 0.0355833143776
>> spammean 0.974508870296 spamvar 0.00530574730638
>>
>> hammean 0.0713015705881 hamvar 0.0303987810665
>> spammean 0.976734052416 spamvar 0.00468795224229
>>
>> For whatever reason(s), I note that the ham variances are higher here than
>> when using p(w), and the spam variances lower. Perhaps that's just due to
>> that f(w) doesn't have an artificial ham bias. OTOH, the prediction set ham
>> distribution is much tighter when using the unbiased f(w), while the spam
>> distribution is much looser.
>>