[Spambayes] z-combining
Rob Hooft
rob@hooft.net
Tue, 15 Oct 2002 15:19:57 +0200
This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> If Rob is feeling particularly adventurous, it would be interesting (in
> conncection with z-combining) to transform the database spamprobs into
> unit-normalized zscores via his RMS black magic, as an extra step at the end
> of update_probabilities(). This wouldn't require another pass over the
> training data, would speed z-combining scoring a lot, and I *think* would
> make the inputs to this scheme much closer to what Gary would really like
> them to be (z-combining *pretends* the "extreme-word" spamprobs are normally
> distributed now; I don't have any idea how close that is to the truth).
I'm not exactly sure what you want me to renormalize using my black
magic, but I did make an interesting histogram of 250000 single-token
spam probabilities... I'm hoping you're not assuming that this is
normally distributed, although it looks like that is what you are trying
to do when recalculating this into Z-scores. Out of the 250k tokens I
put in my histogram, 93k occurred exactly once in the ham corpus of 4500
messages only, and ~75k exactly once in the spam corpus of 4500 messages
only..... The noise you see at the baseline is messages that occur
multiple times in both ham and spam; amplified in the second image where
all words that occur only once or twice are removed from the histogram.
A histogram of words that occur more than 30 times in total is a bit
more flat, but still has many >30+0 / 0+>30 extremes.
My strongest ham clue is "wrote:" (763+0) second "het" (533+0) [Dutch
for "the" for words without gender and for "it"], at the spam side it is
"8bit%:100" (0+937) and "charset:ks_c_5601-1987" (0+838)
> The
> attraction of this scheme is that it gives a single "spam probability"
> directly; combining distinct ham and spam indicators is still a bit of a
> puzzle (although a happy puzzle from my POV when both indicators suck, as
> happens in chi combining with large numbers of strong clues on both ends).
I don't see why this schema could not produce a "H" value as well, and
then mix it with the "S" score we're using now. This schema looks a lot
like the "S" half of earlier ones like chi2 combining. Think about what
goes wrong if we would only use the S half of chi2 combining: messages
that look like both ham and spam come out as perfect spam, and messages
that look neither like ham nor spam come out as perfect ham.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: probfreq.png
Type: image/png
Size: 9903 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/probfreq.png
---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: prob2.png
Type: image/png
Size: 7985 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/prob2.png
---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: balk29.png
Type: image/png
Size: 5900 bytes
Desc: not available
Url : http://mail.python.org/pipermail-21/spambayes/attachments/20021015/87672913/balk29.png
---------------------- multipart/mixed attachment--