[spambayes-dev] The naive bayes classifier algorithm in spambayes doesn't take in frequency?

Tue Aug 31 14:54:59 CEST 2004

Hello.

I have a question on the naive bayes classifier algorithm used in spambayes.

I suppose if word1 appeared in three ham mail, the probability of word1 
being in ham mail would be greater than when it appeared in one ham mail:

(using spambayes-1.0rc2)
>>>c=storage.DBDictClassifier('test.db')
>>>def tok(s): return s.split()
>>>c.learn(tok('word1'),is_spam=False)
>>>c.spamprob(tok('word1'))
0.15517241379310343
>>>c.learn(tok('word1'),False)
>>>c.spamprob(tok('word1'))
0.091836734693877542
>>>c.learn(tok('word1'),False)
>>>c.spamprob(tok('word1'))
0.065217391304347783

As you see the spam probability declines. So far so good.

>>>c.learn(tok('word1'),True)

And word1 also appeared in one spam mail, but it appeared in three ham mail 
before.

>>>c.spamprob(tok('word1'))
0.5

Hm... Sounds not very right.

>>>c.learn(tok('word1'),False)
>>>c.spamprob(tok('word1'))
0.5

Stays still.

This doesn't sound intuitive. For example, word1 occurred in 1000 spam email 
and occured in 1 ham mail. What is the probability of one mail that contains 
word1 being spam mail? Half and half? Doesn't it take in the number of 
occurences(it does seem to take in the number of distinct tokens though)? It 
seems like the concept of the number of occurences and the number of 
distinct tokens are mixed in spambayes' classifier.

Machine Learning by Tom Mitchell(esp. page 183 and  
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html see also 
p.s.) suggests a formula that gives a quite different result from spambayes' 
classifier. It always takes in the number of occurences, hence more 
intuitive.

Am I missing something big? Thanks in advance,

Jane

-------------
p.s.

The formula in Tom Mitchell's book is:

Vocabulary is the set of all distinct words and other tokens occuring in any 
text document from Examples
For each target value v_j in V do
	* docs_j is the subset of documents from Examples for which the target 
value is v_j
	* P(v_j)= | docs_j | / | Examples |
	* Text_j is a single document created by concatenating all members of 
docs_j
	* n is the total number of distinct word positions in Text_j
	* for each word w_k in Vocabulary
	    * P(w_k|v_j) = ( n_k + 1 )/( n + |vocabulary| )
	    * n_k is the number of times word w_k occurs in Text_j

As you see it uses m-estimate.

_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*  
http://join.msn.com/?page=features/junkmail