[Spambayes] RE: Central Limit Theorem??!! :)

Fri, 27 Sep 2002 00:43:17 -0400

[Gary Robinson]
> ...
> I wonder if the errors for lcm are mostly in the region where there
> are a small number of data points, such that the central mean theorem
> isn't really kicking in.
>
> That is, it may be that there is some number n for which f(w)
> gives the best results if the number of non-middling words is < n and
> and lcm gives the best results if the number of non-middling words
> is > n.
>
> That WOULD make a lot of theoretical sense, because for small
> enough n, the central mean theorem is meaningless and can only make
> trouble for us.
>
> Something else for you to test in your copious free time. ;)

This may be very cool!  I speculated about another possibility for errors in
this approach with Guido, and dumping in some instrumentation appears to
confirm it.

First, yes, some errors are due to very low n -- like it only finds 8 words
in an entire msg.  Those are hard to score for any scheme.  But these cases
usually *also* suffer the same problem I'll eventually get around to
revealing <wink>.

Second, here are some *typical* internal z-scores while predicting ham, all
with at least 30 non-middling words (this is just a slice I took from the
output, while it was predicting against 6 known ham):

 zham           zspam
 2.29985206263  -76.3424101961
 0.187039535126 -60.6685540929
 0.16058734364  -43.5223790527
 0.303545599809 -64.5043366748
 2.32811619768  -80.3108808262
 2.08243355217  -56.6967511599

Now if something is 60 spam sdevs away from the spam mean, and 1 ham sdev
away from the ham mean, extreme confidence is surely justified.  While
predicting a spam, it's never so extreme, because the population ham
variance is much larger than the population spam variance, so the z-scores
away from the ham mean simply can't get as large (btw, I believe this is why
it has such a pronounced tendency to err on the false negative side); even
so, extreme confidence is still justified with numbers like these (a typical
slice when predicting against 6 known spam):

 zham           zspam
 -26.1507326771 -0.680077213248
 -28.3253589669  1.10297422272
 -28.3253589669  1.10297422272
 -28.9332374355  1.31350047503
 -26.5203302612 -0.236968008101
 -37.2333822428 -0.722498689497

That brings us to the mistakes, and this seems a *typical* mistake when a
false negative pops up (there are plenty of words here; that's not the
problem):

 zham            zspam
 -17.8741370033  -20.4279646914

The best guess it can make is that it's closer to ham, but let's get real
about this <wink>:  these are honest-to-God probabilities (well, directly
related to honest-to-God probabilities), and at 18 sdevs away from the ham
mean, the system is screaming there's not a chance in hell the msg fits what
it knows about ham.  It's *also* screaming there's not a chance in hell the
msg fits what it knows about spam.

The only rational conclusion to draw is that the system is utterly baffled,
*and knows it*, so should kick such a msg out for manual review.  Not only
"a middle ground", but a principled middle ground where the system itself
knows it has no confidence in its decision, because both outcomes are
astronomically unlikely based on all it knows.

The messages that fall into this class *are* unusual, too!  I'm still
staring at two from the last run trying to decide whether they're really ham
or spam!  A third is one we debated on this list, and it took a google
search for related msgs to decide it was really spam.  That's extremely
cool, if the pattern holds:  nothing else has been so certain about its
uncertainty, and nothing else has pinpointed msgs I'm also uncertain about.

I can't make more time to pursue this now, but it's very exciting.

Another thing that *may* be cool:  the central limit approaches are a bitch
to train over time, because new messages change probabilities, probabilities
change extremes, and that means whenever you add a message then "in theory"
you should go back over all the ham and spam you've ever trained on and grab
what may be new extremes from them (in order to compute new population means
and vars).

But here are the ham and spam population means and vars from two runs on
disjoint random subsets of 2000 ham and 1400 spam (this is the logarithmic
version):

Run 1
hammean  -0.324091120632 hamvar  0.555156484853
spammean -0.104654537681 spamvar 0.121601025545

Run 2
hammean  -0.321945924099 hamvar  0.546761402392
spammean -0.105809267575 spamvar 0.124020283754

They're much the same across runs.  This, combined with the exteme imbalance
in z-scores during "typical" predictions, suggests that it may be possible
to do this *part* of training only once -- if the population means and
variances here have any sort of objective meaning <wink>, they're simply not
going to change much provided they were trained on lots of data to begin
with.