[Spambayes] Central limit

Tim Peters tim.one@comcast.net
Tue, 01 Oct 2002 01:25:28 -0400


[Tim]
>> and then I went on to choose different cutoffs depending on n (n is
>> the number of "extreme words" found in the msg, with a maximum of 50).

[Anthony Baxter]
> What happens past 50?

I don't know.  Gary originally suggested 30, and the only reason I tried 50
this time was due to a braino (I was editing the 150 max_discriminators
value we use now, and unthinkingly just deleted the "1").  I have no results
for any value other than 50.

> extracting just the ones where it was "dead wrong"...

By this I guess you mean the error cases where the ratio exceeded 1.8?

>> 21: [448] {0.45%} 1.09 2.54
>> 23: [535] {0.56%} 1.38 1.72 2.20
>> 25: [638] {0.63%} 1.04 1.55 1.64 1.85
>> 29: [811] {0.49%} 1.03 1.18 1.41 2.24
>> 30: [763] {0.39%} 1.04 1.04 2.08

> What's the plot of cutoff -vs- uncertain messages like?
> How do these relate?

Sorry, I don't know what you mean.  Here's a histogram showing the # of
predictions made at each ratio, where the "99.0" bucket includes all ratios
>= 99.0 (there are a lot of those!):

90000 items; mean 96.06; sdev 2174.03
* = 141 items
  1.0  794 ******
  2.0 1411 ***********
  3.0 2067 ***************
  4.0 2373 *****************
  5.0 2640 *******************
  6.0 2708 ********************
  7.0 2883 *********************
  8.0 2747 ********************
  9.0 2598 *******************
 10.0 2478 ******************
 11.0 2307 *****************
 12.0 2194 ****************
 13.0 2008 ***************
 14.0 1906 **************
 15.0 2403 ******************
 16.0 5814 ******************************************
 17.0 5650 *****************************************
 18.0 3635 **************************
 19.0 2133 ****************
 20.0 1762 *************
 21.0 1634 ************
 22.0 1351 **********
 23.0 1154 *********
 24.0 1001 ********
 25.0  937 *******
 26.0  871 *******
 27.0  898 *******
 28.0  861 *******
 29.0  949 *******
 30.0  952 *******
 31.0  937 *******
 32.0  869 *******
 33.0  812 ******
 34.0  715 ******
 35.0  745 ******
 36.0  736 ******
 37.0  576 *****
 38.0  573 *****
 39.0  551 ****
 40.0  520 ****
 41.0  463 ****
 42.0  486 ****
 43.0  445 ****
 44.0  451 ****
 45.0  374 ***
 46.0  349 ***
 47.0  365 ***
 48.0  365 ***
 49.0  288 ***
 50.0  319 ***
 51.0  299 ***
 52.0  276 **
 53.0  281 **
 54.0  273 **
 55.0  255 **
 56.0  246 **
 57.0  239 **
 58.0  213 **
 59.0  236 **
 60.0  211 **
 61.0  188 **
 62.0  205 **
 63.0  178 **
 64.0  164 **
 65.0  162 **
 66.0  190 **
 67.0  177 **
 68.0  174 **
 69.0  145 **
 70.0  175 **
 71.0  155 **
 72.0  168 **
 73.0  123 *
 74.0  140 *
 75.0  132 *
 76.0  130 *
 77.0  133 *
 78.0  121 *
 79.0  119 *
 80.0  122 *
 81.0  125 *
 82.0  124 *
 83.0   97 *
 84.0   96 *
 85.0  125 *
 86.0   99 *
 87.0   93 *
 88.0   94 *
 89.0  102 *
 90.0   99 *
 91.0  105 *
 92.0   88 *
 93.0   82 *
 94.0   95 *
 95.0   72 *
 96.0   72 *
 97.0   82 *
 98.0   82 *
 99.0 8580 *************************************************************

I suppose you can get a crude answer to whatever it is you're asking from
staring at that <wink>.  Here's restricted to ratios < 10.0:

20221 items; mean 61.62; sdev 23.19
* = 6 items
 1.00  93 ****************
 1.10  74 *************
 1.20  69 ************
 1.30  75 *************
 1.40  71 ************
 1.50  66 ***********
 1.60  69 ************
 1.70  90 ***************
 1.80  91 ****************
 1.90  96 ****************
 2.00  94 ****************
 2.10 119 ********************
 2.20 126 *********************
 2.30 146 *************************
 2.40 136 ***********************
 2.50 144 ************************
 2.60 134 ***********************
 2.70 168 ****************************
 2.80 167 ****************************
 2.90 177 ******************************
 3.00 192 ********************************
 3.10 176 ******************************
 3.20 222 *************************************
 3.30 203 **********************************
 3.40 198 *********************************
 3.50 230 ***************************************
 3.60 205 ***********************************
 3.70 183 *******************************
 3.80 209 ***********************************
 3.90 249 ******************************************
 4.00 207 ***********************************
 4.10 253 *******************************************
 4.20 204 **********************************
 4.30 212 ************************************
 4.40 253 *******************************************
 4.50 240 ****************************************
 4.60 249 ******************************************
 4.70 246 *****************************************
 4.80 270 *********************************************
 4.90 239 ****************************************
 5.00 258 *******************************************
 5.10 240 ****************************************
 5.20 242 *****************************************
 5.30 256 *******************************************
 5.40 248 ******************************************
 5.50 279 ***********************************************
 5.60 263 ********************************************
 5.70 294 *************************************************
 5.80 286 ************************************************
 5.90 274 **********************************************
 6.00 259 ********************************************
 6.10 261 ********************************************
 6.20 257 *******************************************
 6.30 278 ***********************************************
 6.40 278 ***********************************************
 6.50 241 *****************************************
 6.60 279 ***********************************************
 6.70 287 ************************************************
 6.80 287 ************************************************
 6.90 281 ***********************************************
 7.00 299 **************************************************
 7.10 291 *************************************************
 7.20 311 ****************************************************
 7.30 285 ************************************************
 7.40 281 ***********************************************
 7.50 259 ********************************************
 7.60 292 *************************************************
 7.70 288 ************************************************
 7.80 285 ************************************************
 7.90 292 *************************************************
 8.00 249 ******************************************
 8.10 271 **********************************************
 8.20 261 ********************************************
 8.30 289 *************************************************
 8.40 269 *********************************************
 8.50 275 **********************************************
 8.60 294 *************************************************
 8.70 290 *************************************************
 8.80 281 ***********************************************
 8.90 268 *********************************************
 9.00 258 *******************************************
 9.10 263 ********************************************
 9.20 268 *********************************************
 9.30 280 ***********************************************
 9.40 279 ***********************************************
 9.50 247 ******************************************
 9.60 253 *******************************************
 9.70 244 *****************************************
 9.80 265 *********************************************
 9.90 241 *****************************************

>> 2. There are notable exceptions to that, but error rates are so low
>>    that a single message makes a large difference in error rate.

> Is there anything "magic" about those 5 fns? Were they the usual
> suspects? Does inspecting them by hand give any clues about other
> tokenisation clues that might have helped them? (e.g. if your corpus
> was sufficiently single-sourced that you could turn on all the
> disabled clue-extractors...)

Sorry, I can't relate the errors to msgs.  All I have is a binary pickle
containing 90,000 of these:

class Node(object):
    __slots__ = 'is_spam', 'n', 'zham', 'zspam', 'delta', 'score'

That was generated when I was testing a different "certainty heuristic" that
performed much worse than the one I'm talking about now, and its text output
file doesn't contain any error cases with ratios larger than about 1.1 (so
it doesn't contain the errors in question now).  It never made a mistake,
but it considered huge numbers of msgs to be uncertain -- if 25% of msgs are
kicked out for manual review, I'd consider the scheme wholly impractical.

>> 4. 5 predictions (of 90,000) were wrong with a ratio greater than 1.8.

> And all of those were fn, not fp.

That's right.  In this particular test.  OTOH, this particular test ran 90
times each training on 500+500 then predicting against 4500+4500, so it was
giving itself a hard job.  I've got lots of reasons to believe that training
on 500 ham and 500 spam isn't enough to get reasonable coverage of the
diversity in my corpora.

Offline, Guido tried the use_central_limit2 code exactly as-is on a much
larger test, training on about 8K ham + 3K spam for each run.  I don't
recommend doing that because the "scores" produced by the code as-is make no
sense -- they basically produce 1 bit of information (which zscore was
smaller?) in a highly confusing way, and a way that's not symmetric around
0.5.  I believe he also used max_discriminators=150 (the default these
days), which may well be "too large" for the log-central-limit code (Gary
designed it to make extreme use of the extreme words, and there's no message
that has 150 distinct extreme words).

Even so, compared to our current default scheme, his bottom lines across 90
runs were:

total unique fp went from 904 to 324 won    -64.16%
mean fp % went from 0.662958214428 to 0.232509170721 won    -64.93%

total unique fn went from 97 to 275 lost  +183.51%
mean fn % went from 0.127271524421 to 0.328802849112 lost  +158.35%

and we've already seen that this scheme is less certain about spam than
about ham.  Alas, there's no way to know what the "certainty heuristic"
would have said in Guido's large run (there's no code checked in for that,
and I'm having an increasingly hard time making insane amounts of time for
this project).