[Spambayes] RE: Central Limit Theorem??!! :)

Tim Peters tim.one@comcast.net
Mon, 23 Sep 2002 13:36:37 -0400


[Gary Robinson]
> ...
> Still seems better than p(w), which  is assuming s is 0!

It approaches that, but not quite:  p(w) is still bounded by 0.01 and 0.99.
In small training sets, something I've seen several times is that the
training data ends up with an original message or a reply to it, and the
prediction data contains the other half, and sometimes there's a word
common to that pair but otherwise unique across the entire corpus.  p(w)
assigns that word prob 0.01 (well, our p(w) does; Graham's gives it the
"unknown word prob" due to "not appearing often enough"), and that really
helps the prediction.  Of course it works out in the other direction too --
for example, a spam containing the misspelling "propsal" (for "proposal")
became an fn just because somebody once made the same misspelling in my ham.
Overall it seems much more *likely* that unique typos are shared between a
message and a reply, though, and "message + reply containing some quote" are
virtually always both ham.  OTOH, spam seems to contain common wild
misspellings too.

> That's just another decision for a particular constant s... there's no
> getting getting away from the issue.

Indeed there isn't.