[Spambayes] RE: Central Limit Theorem??!! :)
Tim Peters
tim.one@comcast.net
Mon, 23 Sep 2002 13:36:37 -0400
[Gary Robinson]
> ...
> Still seems better than p(w), which is assuming s is 0!
It approaches that, but not quite: p(w) is still bounded by 0.01 and 0.99.
In small training sets, something I've seen several times is that the
training data ends up with an original message or a reply to it, and the
prediction data contains the other half, and sometimes there's a word
common to that pair but otherwise unique across the entire corpus. p(w)
assigns that word prob 0.01 (well, our p(w) does; Graham's gives it the
"unknown word prob" due to "not appearing often enough"), and that really
helps the prediction. Of course it works out in the other direction too --
for example, a spam containing the misspelling "propsal" (for "proposal")
became an fn just because somebody once made the same misspelling in my ham.
Overall it seems much more *likely* that unique typos are shared between a
message and a reply, though, and "message + reply containing some quote" are
virtually always both ham. OTOH, spam seems to contain common wild
misspellings too.
> That's just another decision for a particular constant s... there's no
> getting getting away from the issue.
Indeed there isn't.