[Spambayes] FYI: Java implementation

Tue Jan 21 11:12:00 EST 2003

[Anthony Baxter]
> It's also worth noting that the optimal cutoff values before chi-combining
> varied between 0.5 something and 0.7 for some people. It was impossible to
> pick a number that worked for everyone.

Ah, memories <wink>.

> (yes, I do plan to re-do the plots off the same data set at some point,
> and add some for the CLM combiners... - if someone wants to do it first
> and save me the effort, it would be faaaabulous)

Assuming CLM refers to the three central-limit combining schemes, they never
got far enough to develop a rational notion of "score".  They were the first
schemes that "knew when they were confused", and that caught us by surprise:
the initial stabs at getting "a score" out of them were like
Graham-combining in that they were sometimes extremely certain of a wrong
answer.  It took a while to realize that, when this happened, an internal
(for example) spam score was 50 sdevs on the spam of the ham mean,
simultaneous with the internal ham score being 40 sdevs on the ham side of
the spam mean.  The overall result was extreme certainty that the thing was
spam, although the internal scores were certain it was neither.  Once we
figured that out, testing proceeded by producing one of exactly three
scores:  "it's ham", "it's spam", "I'm lost".  That's as far as they got, at
which point chi-combining appeared, also knowing when it was lost, but far
less problematic for training, and producing a "smooth" score naturally.

A CLM plot would consist of three vertical lines, and so be a bit confusing
<wink>.