[Spambayes] spamprob combining

Rob Hooft rob@hooft.net
Thu, 10 Oct 2002 07:00:04 +0200


Tim Peters wrote:
> "Tim combining" simply takes the geometric mean of the spamprobs as a
> measure of spamminess S, and the geometric mean of 1-spamprob as a measure
> of hamminess H, then returns S/(S+H) as "the score".  This is well-behaved
> when fed random, uniformly distributed probabilities, but isn't reluctant to
> let an overwhelming number of extreme clues lead it to an extreme conclusion
> (although you're not going to see it give Graham-like 1e-30 or
> 1.0000000000000 scores).

While reading this I had a sudden thought: With the distributions I'm 
normally interested in, I want to explain the "bulk" accurately, without 
being extremely sensitive to the tails. e.g. in my previous job, the 
bulk was a database of protein structures, and I wanted to describe the 
bulk so that I could recognize the outliers. In my current job, the 
population is pixel activity on a CCD, and I don't want to be sensitive 
to bad pixels.

The standard way to calculate a standard deviation is to calculate the 
mean first, and then calculate (x-<x>)^2/(n-1) in a second pass over the 
numbers. This is rather sensitive to outliers, however. In both cases I 
have experience with, the best way to describe the bulk is to use the 
median, and "median ways" to calculate the standard deviation. These 
methods absolutely ignore the extreme values.

But now spambayes. The bulk are words like "the" and "with" and "want" 
and,.... All totally uninteresting. So if we want to be sensitive to 
outliers, we should "go the other way". We have two options I can think off:
  * use a (x-<x>)^4 function. This will be very sensitive to extremes.
  * calculate the mean and standard deviation both using the standard
    technique and using medians, and then use the DIFFERENCE between the
    result as a measure of the extreme-characteristic.

Just some random ideas I wouldn't yet know how to apply.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/