[Spambayes] RE: Central Limit Theorem??!! :)

Gary Robinson grobinson@transpose.com
Mon, 23 Sep 2002 10:06:47 -0400


Error:

> When training on the spam side, don't use f(w), use: ln (1-f(w)).
> 
> When training on the ham side, don't use f(w), use: ln f(w).

is backwards. Do exactly the  reverse:

When training on the spam side, don't use f(w), use: ln f(w).

When training on the ham side, don't use f(w), use: ln (1-f(w)).

Also, those numbers will be negative. If you want positive numbers, use -(ln
f(w)) and -(ln (1-f(w))). The minus sign will make no difference whatsoever
to the results.




--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Gary Robinson <grobinson@transpose.com>
> Date: Mon, 23 Sep 2002 08:46:10 -0400
> To: Tim Peters <tim.one@comcast.net>, SpamBayes <spambayes@python.org>, Greg
> Louis <glouis@dynamicro.on.ca>
> Subject: Re: [Spambayes] RE: Central Limit Theorem??!!     :)
> 
> OK. That's fascinating.
> 
> Remember that the multiplicative method in S, which calculates the geometric
> mean of the f(w)'s, stresses the MOST extreme values more than the less
> extreme ones. The more extreme the value is, the more it is stressed. Very
> extreme ones are stressed very highly, in an exponentially compounding way if
> there are several really extreme ones.
> 
> That's why it's the basis for that 1971 optimality theorem that I kept trying
> to invoke more strongly, but which is still invoked to a degree in S.
> 
> That's the reasoning behind S in the first place, and why it works so well,
> and it also happens in Graham's original approach (but only on one side), but
> we are completely losing it in R.
> 
> In R, we are trading that very powerful multiplicative effect away in order to
> get the benefit of real parametric statistics. ALSO a very powerful technique
> but apparently slightly less powerful in this application -- at least when
> used alone.
> 
> If there is a performance loss in R (and there are no remaining coding
> errors), I am confident that that's why.
> 
> THERE IS A POTENTIAL FIX FOR THIS LOSSAGE, so that we can theoretically get
> the best of both totally different techniques.
> 
> When training on the spam side, don't use f(w), use: ln (1-f(w)).
> 
> When training on the ham side, don't use f(w), use: ln f(w).
> 
> Same when testing. Don't add the f(w)'s increating the sample mean; add the
> expressions above, and divide by n. So the spam side uses ln (1-f(w)) both for
> training and calculating the sample means, and the ham side uses ln f(w) for
> both.
> 
> As we've discussed, averaging the ln's is the same thing as a geometric mean
> if you then subsequently raise e to the power of that computed average. But we
> don't do that last step here. We just feed the arithmetic mean of the ln's
> into the z-score calcs.
> 
> This *should* bring us the benefits of the multiplicative approach and of the
> parametric stats approach at the same time.
> 
> The downside of this is that the ln's make such a skewed distribution that it
> should take a bigger n to make the central limit theorem kick in. BUT, OTOH,
> it WILL kick in, and something like 30 may really still be enough (it usually
> kicks in at significantly smaller numbers). And you've also successfully used
> n=150 with f(w) and that is DEFINITELY enough.
> 
> The other downside is that it just seems like a bit  of a wild thing to do and
> sometimes when you do wild things, strange reasons emerge why they won't work.
> But I really can't see any at this point as long as n is big enough that the
> sample means take on a normal distribution.
> 
> THANKS for doing all the coding work to test this idea!!!!  :)
> 
> Gary
> 
> 
> 
> 
> 
>> 
>> This was using 30 extremes, and using Graham's p(w) (complete with hambias
>> 2, minprob .01 and maxprob .99).  The f-n rate was more than 10x worse using
>> f(w) with a=x=0.5, and I have no idea why yet (and we're *generally* having
>> problems with f-n rates on smaller training sets when using f(w), whether
>> using the central-limit scoring, or Gary's previous scoring; perhaps 'a'
>> needs to be much smaller than 0.5 -- there's too much to test here).
>> 
>> Here are the aggregate scaled R values (clamped to [-20, 20], and then
>> scaled linearly into [0, 1]):
>> 
>> Ham distribution for all runs:
>> 5000 items; mean 0.35; sample sdev 3.62
>> * = 82 items
>> 0.00 4907 ************************************************************
>> 2.50   13 *
>> 5.00   14 *
>> 7.50   12 *
>> 10.00    4 *
>> 12.50   11 *
>> 15.00    6 *
>> 17.50    7 *
>> 20.00    4 *
>> 22.50    2 *
>> 25.00    2 *
>> 27.50    3 *
>> 30.00    0
>> 32.50    1 *
>> 35.00    3 *
>> 37.50    0
>> 40.00    1 *
>> 42.50    1 *
>> 45.00    1 *
>> 47.50    1 *
>> 50.00    1 *
>> 52.50    0
>> 55.00    1 *
>> 57.50    0
>> 60.00    0
>> 62.50    1 *
>> 65.00    2 *
>> 67.50    0
>> 70.00    0
>> 72.50    0
>> 75.00    0
>> 77.50    0
>> 80.00    0
>> 82.50    0
>> 85.00    0
>> 87.50    0
>> 90.00    0
>> 92.50    0
>> 95.00    0
>> 97.50    2 *
>> 
>> Spam distribution for all runs:
>> 5000 items; mean 98.97; sample sdev 5.87
>> * = 80 items
>> 0.00    0
>> 2.50    0
>> 5.00    0
>> 7.50    0
>> 10.00    0
>> 12.50    0
>> 15.00    0
>> 17.50    0
>> 20.00    1 *
>> 22.50    0
>> 25.00    0
>> 27.50    3 *
>> 30.00    1 *
>> 32.50    1 *
>> 35.00    1 *
>> 37.50    1 *
>> 40.00    1 *
>> 42.50    0
>> 45.00    1 *
>> 47.50    3 *
>> 50.00    6 *
>> 52.50    5 *
>> 55.00    5 *
>> 57.50    3 *
>> 60.00    7 *
>> 62.50    6 *
>> 65.00    9 *
>> 67.50   11 *
>> 70.00   19 *
>> 72.50   11 *
>> 75.00    8 *
>> 77.50   10 *
>> 80.00   13 *
>> 82.50    9 *
>> 85.00   11 *
>> 87.50   16 *
>> 90.00   31 *
>> 92.50    8 *
>> 95.00   15 *
>> 97.50 4784 ************************************************************
>> 
>> All of the false positives had significant numbers of both 0.01 and 0.99
>> clues.  This seems to be a reappearance of the p(w) "cancellation disease"
>> that we wormed around before by adding gobs of special-case code to Graham
>> scoring.  Several of the fns also had this problem.  The outcome is like
>> flipping a coin when this happens.  Note that f(w) doesn't have this problem
>> (it's mostly an artifact of that p(w) artificially clamps probabilities, and
>> so many words end up with probs at the extreme values).
>> 
>> Here are the means and variances of the training data scaled R values:
>> 
>> hammean  0.0315194110198 hamvar  0.0102392908745
>> spammean 0.977596060549  spamvar 0.00629493144389
>> 
>> hammean  0.0289322128628 hamvar  0.00860263576484
>> spammean 0.976784463455  spamvar 0.00703754635535
>> 
>> hammean  0.0292168061706 hamvar  0.00850922282341
>> spammean 0.977330456163  spamvar 0.00656386045426
>> 
>> hammean  0.0292418489626 hamvar  0.00869327431102
>> spammean 0.972324957985  spamvar 0.00968199258783
>> 
>> hammean  0.0266295579103 hamvar  0.00745682458391
>> spammean 0.974432833096  spamvar 0.00865812701944
>> 
>> 
>> Finally, here's the same thing (including exactly the same messages in the
>> training and prediction sets) all over again, *except* using f(w) with a=0.1
>> and x=0.5 (I mentioned a=0.5 above; I lowered it again for this run, and
>> that did help the f-n rate, but not much):
>> 
>>     0.000   3.700
>> 0 new false positives
>> 37 new false negatives
>> 
>>     0.000   2.500
>> 0 new false positives
>> 25 new false negatives
>> 
>>     0.000   4.800
>> 0 new false positives
>> 48 new false negatives
>> 
>>     0.100   2.900
>> 1 new false positives
>> 29 new false negatives
>> 
>>     0.100   3.400
>> 1 new false positives
>> 34 new false negatives
>> 
>> total unique false pos 2
>> total unique false neg 173
>> average fp % 0.04
>> average fn % 3.46
>> 
>> Ham distribution for all runs:
>> 5000 items; mean 0.05; sample sdev 1.57
>> * = 84 items
>> 0.00 4991 ************************************************************
>> 2.50    2 *
>> 5.00    0
>> 7.50    1 *
>> 10.00    0
>> 12.50    0
>> 15.00    0
>> 17.50    0
>> 20.00    1 *
>> 22.50    1 *
>> 25.00    1 *
>> 27.50    1 *
>> 30.00    0
>> 32.50    0
>> 35.00    0
>> 37.50    0
>> 40.00    0
>> 42.50    0
>> 45.00    0
>> 47.50    0
>> 50.00    0
>> 52.50    0
>> 55.00    0
>> 57.50    1 *
>> 60.00    0
>> 62.50    0
>> 65.00    0
>> 67.50    0
>> 70.00    0
>> 72.50    0
>> 75.00    0
>> 77.50    1 *
>> 80.00    0
>> 82.50    0
>> 85.00    0
>> 87.50    0
>> 90.00    0
>> 92.50    0
>> 95.00    0
>> 97.50    0
>> 
>> Spam distribution for all runs:
>> 5000 items; mean 94.82; sample sdev 15.16
>> * = 69 items
>> 0.00    5 *
>> 2.50    2 *
>> 5.00    1 *
>> 7.50    5 *
>> 10.00    8 *
>> 12.50    6 *
>> 15.00   10 *
>> 17.50    8 *
>> 20.00    6 *
>> 22.50   14 *
>> 25.00   10 *
>> 27.50   11 *
>> 30.00    9 *
>> 32.50    4 *
>> 35.00   12 *
>> 37.50   12 *
>> 40.00   11 *
>> 42.50    8 *
>> 45.00   16 *
>> 47.50   15 *
>> 50.00   21 *
>> 52.50   21 *
>> 55.00   20 *
>> 57.50   27 *
>> 60.00   21 *
>> 62.50   19 *
>> 65.00   31 *
>> 67.50   18 *
>> 70.00   18 *
>> 72.50   22 *
>> 75.00   22 *
>> 77.50   37 *
>> 80.00   36 *
>> 82.50   50 *
>> 85.00   59 *
>> 87.50   50 *
>> 90.00   67 *
>> 92.50   74 **
>> 95.00   76 **
>> 97.50 4138 ************************************************************
>> 
>> Too bizarre for me -- there may be a gross bug here, but the central-limit
>> code is exactly the same in both cases, and the f(w) code is exactly the
>> same as I've been using with good results for a few days.
>> 
>> hammean  0.083575863227  hamvar 0.0384359627039
>> spammean 0.978952388668 spamvar 0.00372443106446
>> 
>> hammean  0.075515986459  hamvar 0.0331126798862
>> spammean 0.97742506528  spamvar 0.00420869219347
>> 
>> hammean  0.0776481612081 hamvar 0.0338362239317
>> spammean 0.978462136207 spamvar 0.00341505065601
>> 
>> hammean  0.07972071882   hamvar 0.0355833143776
>> spammean 0.974508870296 spamvar 0.00530574730638
>> 
>> hammean  0.0713015705881 hamvar 0.0303987810665
>> spammean 0.976734052416 spamvar 0.00468795224229
>> 
>> For whatever reason(s), I note that the ham variances are higher here than
>> when using p(w), and the spam variances lower.  Perhaps that's just due to
>> that f(w) doesn't have an artificial ham bias.  OTOH, the prediction set ham
>> distribution is much tighter when using the unbiased f(w), while the spam
>> distribution is much looser.
>>