[Spambayes] z-combining

Rob Hooft rob@hooft.net
Tue, 15 Oct 2002 15:19:57 +0200


This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:

> If Rob is feeling particularly adventurous, it would be interesting (in
> conncection with z-combining) to transform the database spamprobs into
> unit-normalized zscores via his RMS black magic, as an extra step at the end
> of update_probabilities().  This wouldn't require another pass over the
> training data, would speed z-combining scoring a lot, and I *think* would
> make the inputs to this scheme much closer to what Gary would really like
> them to be (z-combining *pretends* the "extreme-word" spamprobs are normally
> distributed now; I don't have any idea how close that is to the truth). 

I'm not exactly sure what you want me to renormalize using my black 
magic, but I did make an interesting histogram of 250000 single-token
spam probabilities... I'm hoping you're not assuming that this is 
normally distributed, although it looks like that is what you are trying 
to do when recalculating this into Z-scores. Out of the 250k tokens I 
put in my histogram, 93k occurred exactly once in the ham corpus of 4500 
messages only, and ~75k exactly once in the spam corpus of 4500 messages 
only..... The noise you see at the baseline is messages that occur 
multiple times in both ham and spam; amplified in the second image where 
all words that occur only once or twice are removed from the histogram. 
A histogram of words that occur more than 30 times in total is a bit 
more flat, but still has many >30+0 / 0+>30 extremes.

My strongest ham clue is "wrote:" (763+0) second "het" (533+0) [Dutch 
for "the" for words without gender and for "it"], at the spam side it is 
"8bit%:100" (0+937) and "charset:ks_c_5601-1987" (0+838)

> The
> attraction of this scheme is that it gives a single "spam probability"
> directly; combining distinct ham and spam indicators is still a bit of a
> puzzle (although a happy puzzle from my POV when both indicators suck, as
> happens in chi combining with large numbers of strong clues on both ends).

I don't see why this schema could not produce a "H" value as well, and 
then mix it with the "S" score we're using now. This schema looks a lot 
like the "S" half of earlier ones like chi2 combining. Think about what 
goes wrong if we would only use the S half of chi2 combining: messages 
that look like both ham and spam come out as perfect spam, and messages 
that look neither like ham nor spam come out as perfect ham.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: probfreq.png
Type: image/png
Size: 9903 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/probfreq.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: prob2.png
Type: image/png
Size: 7985 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/prob2.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: balk29.png
Type: image/png
Size: 5900 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/balk29.png

---------------------- multipart/mixed attachment--